[BioC] P values on Log or Non-Log Values
Gordon Smyth
smyth at wehi.edu.au
Tue May 6 12:01:46 MEST 2003
At 01:36 AM 6/05/2003, James MacDonald wrote:
> >From a theoretical standpoint it is more correct to do t-tests on logged
> data because one of the assumptions of the t-test is that the underlying
> data are normally distributed. Microarray expression values are almost
> always strongly right-skewed, and logging causes the distribution to
> become much more symmetrical.
>
>It is doubtful that the logged data are normally distributed, but the
>t-test is fairly robust to violations of the normality assumption as long
>as the data are relatively symmetrical.
Don't forget that results on the robustness of the t-test to normality
assume that (i) there are a reasonable number of objections, at least 15
say, and (ii) the p-values which need to be accurate are those around 0.05
rather than around 1e-5. Neither of these assumptions are true in the
microarray context!
But the main point here is, as Jim says, it has to be a whole lot better on
the log-scale because the log-intensities are more symmetrically distributed.
Cheers
Gordon
>You can also permute your data to estimate the null distribution if you
>want to remove the reliance on normality. However, in my opinion it is
>still better to use symmetrical (logged) data when permuting.
>
>HTH,
>
>Jim
>
>
>James W. MacDonald
>UMCCC Microarray Core Facility
>1500 E. Medical Center Drive
>7410 CCGC
>Ann Arbor MI 48109
>734-647-5623
>
> >>> "Park, Richard" <Richard.Park at joslin.harvard.edu> 05/05/03 10:34AM >>>
>Hi Everyone,
>I am currently using the mt.teststat to calculate p-values between various
>samples. I was wondering if anyone knew if it was ok to run p-values on
>logged or non-logged values? In the past using MAS processing, I always
>calculated pvalues on the raw values, however I have recently switched to
>processing cel files through rma and the raw data produced from this
>processing is log base 2.
>
>My lab has noticed that log transformation Is not very visible with high
>p.values (above 0.1), but spreads them all over the place in the low
>(significant !) range. By running a t.test on loged values, it greatly
>enhances the significance (up to 100-fold, compared to running on straight
>values) when significance derives from tight distributions, but has very
>little or no effect when significance derives from more distant means
>
>Anyone have any ideas on which method is correct?
>
>thanks,
>Richard Park
>Computational Data Analyzer
>Joslin Diabetes Center
More information about the Bioconductor
mailing list