[BioC] limma, analysis on subset of data gives completely different results

Thu Jul 17 17:21:43 CEST 2014

Hi Arvid,

On 7/17/2014 9:59 AM, Arvid Sondén wrote:
> Dear all,
>
> I am currently working with gene expression analysis in limma. I have a total of 146 samples divided into 21 groups. What I want to do is pairwise comparisons between one group (the control group) and the others. The following code shows this for the first pairwise comparison between group B and the control group, also adding batch effects to the model. All groups are included in the "Group" variable.
>
> design <- model.matrix(~0+Group+Batch)
> fit<-lmFit(y$E,design)
> cont  <- makeContrasts( " GroupB- GroupControl", levels=design)
> fit <- contrasts.fit(fit, cont)
> fit <- eBayes(fit)
> tt <- topTable(fit, adjust="BH",  coef=" GroupB-GroupControl ", genelist=y$genes, number=Inf)
>
>>From the beginning I was only working with this first comparison, and was only using the data from group B and the control group. Now I have extended this to all the data and all the pairwise comparisons. Since I am using all of the data in the lmFit function the fit is different from before when I was only using a part of the data. What makes me confused is that the difference is quite large. Now I have 1378 significant genes compared to 203 before for the GroupB-GroupControl comparison after the BH correction.
>
> Is there a possible limma specific explanation for this? I have read the documentation on the functions, and the limma user's guide, but I can't say that I have fully understood what is going on inside the lmFit function. On a more conceptual level I understand that the linear model will change when I add new data and new variables, but it seems to be a too large change in my eyes since the actual comparison is still the same.
>

The eBayes() step may have an effect since you are using a larger number 
of samples, and hypothetically the prior would be more accurate. But I 
would imagine that has little to do with it, as the eBayes() step is 
primarily intended to help smooth the variance when you have very few 
replicates. Once you get past a certain number of samples I would 
imagine it has a diminishing return.

On the other hand, you are increasing your degrees of freedom markedly 
by including all the other groups, so your variance estimates will be 
much more accurate (and you are then borrowing information from all 
groups to estimate the intra-group variance, which is how ANOVA works, 
and isn't something specific to limma). I would think this has the 
largest effect on your results.

Best,

Jim

> Best regards,
>
> Arvid
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099