[BioC] limma, analysis on subset of data gives completely different results

Thu Jul 17 15:59:28 CEST 2014

Dear all,

I am currently working with gene expression analysis in limma. I have a total of 146 samples divided into 21 groups. What I want to do is pairwise comparisons between one group (the control group) and the others. The following code shows this for the first pairwise comparison between group B and the control group, also adding batch effects to the model. All groups are included in the "Group" variable.

design <- model.matrix(~0+Group+Batch)
fit<-lmFit(y$E,design)
cont  <- makeContrasts( " GroupB- GroupControl", levels=design)
fit <- contrasts.fit(fit, cont)
fit <- eBayes(fit)
tt <- topTable(fit, adjust="BH",  coef=" GroupB-GroupControl ", genelist=y$genes, number=Inf) 

>From the beginning I was only working with this first comparison, and was only using the data from group B and the control group. Now I have extended this to all the data and all the pairwise comparisons. Since I am using all of the data in the lmFit function the fit is different from before when I was only using a part of the data. What makes me confused is that the difference is quite large. Now I have 1378 significant genes compared to 203 before for the GroupB-GroupControl comparison after the BH correction. 

Is there a possible limma specific explanation for this? I have read the documentation on the functions, and the limma user's guide, but I can't say that I have fully understood what is going on inside the lmFit function. On a more conceptual level I understand that the linear model will change when I add new data and new variables, but it seems to be a too large change in my eyes since the actual comparison is still the same.

Best regards,

Arvid