[BioC] limma, subsets of design matrix and p-values

Sun Mar 13 19:24:39 CET 2005

Analysis of variance (ANOVA) and related methods such as SAM and eBayes in 
Limma produce p-values that depend on the entire data array, not just the 
treatments of interest.

There are 3 main reasons for this.  In the discussion below "t-value" 
refers to the ordinary t-value for the contrast (for ANOVA), a rank version 
(for tests like the Wilcoxon), permutation versions (for SAM and other 
permutation routines) and (e)Bayesian versions for routines like Limma eBayes.

All the available data are used to estimate the "within variance" which is 
the denominator of the test statistic so the "t-value" (or equivalent) 
depends on all the data, not just the 2 treatments being compared
The degrees of freedom of the test are determined by the data used to 
estimate the "within variance", so even if the "t-value" is the same, the 
p-value will not be.
In some cases, multiple comparisons adjustments are built-in to the 
procedures, and these become more stringent as the number of treatments 
increases.

Any good intro text on ANOVA will elucidate these issues.

--Naomi

At 02:15 PM 3/10/2005, Mike Schaffer wrote:
>I've run limma for a few months and had a question about the p-values 
>being calculated from a large set of data vs. just a subset.
>
>
>All of my 2-color array data is relative to an untreated sample and each 
>is replicated three times.
>For example:
>         Treatment1 vs untreated x 3,
>         Treatment2 vs. untreated x 3,
>         Treatment3 vs. untreated x 3
>         ...etc.
>
>So I have a large MA object of all the data that is normalized within 
>arrays and a design matrix created by:
>
>design <- modelMatrix(targets,ref="untreated")
>
>
>My confusion stems from the fact that if I run eBayes on the entire 
>dataset (code below), I get different p-values (but same M values) for the 
>TreatmentX vs. untreated, than if I fit the subset that only includes data 
>for one treatment (e.g. just Treatment 1 vs. untreated).
>
>For example,
>
>fit <- lmFit(MA,design=design)
>eb <- eBayes(fit)
>eb$p.values[1:10,1]
>
>
>gives different p-values than if I were to only initially subset on the 
>Treatment1 vs. untreated data, and then run lmFit.
>For example,
>
>design <- modelMatrix(targets[1:3,],ref="untreated")
>fit <- lmFit(MA[,1:3],design=design)       # the first three data sets 
>include ALLof the Treatment1 vs. untreated data
>eb <- eBayes(fit)
>eb$p.values[1:10]
>
>
>Am I incorrect to assume that the p-values should be the same regardless 
>of how much data is included in the MA object, as long as the design 
>matrix has no overlap between experiments (e.g. treatment1 vs. untreated 
>data is distinct from treatment2 vs. untreated data) -- aside from the 
>fact they are all relative to an untreated sample?
>
>Or is the moderated t-statistic based on ALL of the data in the MA object 
>regardless of each experiment's relationship to others?
>
>I'd like to read in all the data and keep it in one large RG and MA 
>object.  If I'm looking to determine the p-values for genes induced 
>relative to untreated by just one of the treatments, should I use lmFit on 
>the large MA object or should it be subset first (e.g MA[,1:3]) before 
>doing the linear fit?
>
>Am I missing something?
>
>Thanks, in advance, for any help.
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Bioinformatics Consulting Center
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111