[BioC] P values on Log or Non-Log Values

Tue May 6 12:01:46 MEST 2003

At 01:36 AM 6/05/2003, James MacDonald wrote:
> >From a theoretical standpoint it is more correct to do t-tests on logged 
> data because one of the assumptions of the t-test is that the underlying 
> data are normally distributed. Microarray expression values are almost 
> always strongly right-skewed, and logging causes the distribution to 
> become much more symmetrical.
>
>It is doubtful that the logged data are normally distributed, but the 
>t-test is fairly robust to violations of the normality assumption as long 
>as the data are relatively symmetrical.

Don't forget that results on the robustness of the t-test to normality 
assume that (i) there are a reasonable number of objections, at least 15 
say, and (ii) the p-values which need to be accurate are those around 0.05 
rather than around 1e-5. Neither of these assumptions are true in the 
microarray context!

But the main point here is, as Jim says, it has to be a whole lot better on 
the log-scale because the log-intensities are more symmetrically distributed.

Cheers
Gordon

>You can also permute your data to estimate the null distribution if you 
>want to remove the reliance on normality. However, in my opinion it is 
>still better to use symmetrical (logged) data when permuting.
>
>HTH,
>
>Jim
>
>
>James W. MacDonald
>UMCCC Microarray Core Facility
>1500 E. Medical Center Drive
>7410 CCGC
>Ann Arbor MI 48109
>734-647-5623
>
> >>> "Park, Richard" <Richard.Park at joslin.harvard.edu> 05/05/03 10:34AM >>>
>Hi Everyone,
>I am currently using the mt.teststat to calculate p-values between various 
>samples. I was wondering if anyone knew if it was ok to run p-values on 
>logged or non-logged values? In the past using MAS processing, I always 
>calculated pvalues on the raw values, however I have recently switched to 
>processing cel files through rma and the raw data produced from this 
>processing is log base 2.
>
>My lab has noticed that log transformation Is not very visible with high 
>p.values (above 0.1), but spreads them all over the place in the low 
>(significant !) range. By running a t.test on loged values, it greatly 
>enhances the significance (up to 100-fold, compared to running on straight 
>values) when significance derives from tight distributions, but has very 
>little or no effect when significance derives from more distant means
>
>Anyone have any ideas on which method is correct?
>
>thanks,
>Richard Park
>Computational Data Analyzer
>Joslin Diabetes Center