[BioC] LIMMA similarity of contrast matrix, F test and topTable

Thu Jul 3 16:15:06 CEST 2014

Hi Koen,

On 7/3/2014 9:14 AM, Koen Van den Berge wrote:
> Dear Bioconductors,
>
> I am currently analysing gene expression data using microarrays. I
> currently have a simple model which looks at plant development. The
> design matrix can be replicated as follows:
>
> devStadium <- factor(rep(c("Ovule" , "Seed24h" , "Globular" ,
> "Cotyledon" , "MGreen" , "PMGreen" , "Seedling"),each=2)) design.null
> <-model.matrix(~ devStadium ,
> contrasts.arg=list(devStadium=contr.sum))
>
> Next, I fit the linear models and construct a contrast matrix:
>
> fit.null <- lmFit(eset.par.diff,design=design.null)
>
> contrast.matrix <- makeContrasts( Intercept = (Intercept), Ovule=
> devStadium1, Seed24h= devStadium2, Globular= devStadium3, Cotyledon=
> devStadium4, MGreen= devStadium5, PMGreen= devStadium6, Seedling=
> -c(devStadium1+devStadium2+devStadium3+
> devStadium4+devStadium5+devStadium6), levels = design.null)

This isn't right. Unless you have specified a different order for your 
levels that you don't show here, you are assuming that R will use your 
factor levels in the order specified rather than alphabetical:

 > devStadium
  [1] Ovule     Ovule     Seed24h   Seed24h   Globular  Globular 
Cotyledon Cotyledon MGreen    MGreen    PMGreen   PMGreen
[13] Seedling  Seedling
Levels: Cotyledon Globular MGreen Ovule PMGreen Seed24h Seedling

Plus, if I am not mistaken, your final contrast is wrong as well. Each 
coefficient can be interpreted as the difference between the sample type 
and the grand mean. In order for the design matrix to be full rank, the 
Seedling coefficient has to be specified in terms of the other six 
coefficients. But its interpretation won't be different.

In addition, since the coefficients are themselves the differences you 
seek, you don't have to use contrasts.fit. Just

fit.null <- lmFit(eset.par.diff, design = design.null)
fit.null <- eBayes(fit.null)

Is fine.

>
> fit.null <- contrasts.fit(fit.null, contrast.matrix) fit.null <-
> eBayes(fit.null)
>
> Since I have coded the design matrix using sum coding, the intercept
> should indicate the average gene expression over all the seven
> developmental stadia. I test how many genes show significant
> expression during development through:
>
> null.ict.table <-
> topTable(fit.null,coef=1,n=nrow(fit.null),adjust.method="BH")
> mean(null.ict.table$adj.P.Val < 0.05)
>
> This output tells me that 80.5% of the genes show expression during
> plant development.

I think that statement is debatable. What you have shown is that the 
grand mean is arguably different from zero for 80.5% of the genes. You 
could have a gene with a reliable mean expression of 2^1 and that would 
be significantly different from zero. However, if all of the background 
probes have a reliable expression level of 2^4, are you willing to state 
that a gene with an expression level four-fold lower is really expressed?

Anyway, it seems to me that you are making things much harder than 
necessary. The mean expression value is actually a fairly meaningless 
term, as it is a combination of noise and signal, and it is impossible 
to tease the two apart. This is why people only use microarrays as a 
comparative tool. If you compare the expression level for sample A vs 
sample B, then the noise component of the expression value cancels out, 
more or less, and you are left with primarily signal.

I would instead tend to use the default parameterization 
(contr.treatment), which will allow you to make pretty much any 
comparison you might like. If you specify a cell means model:

model.matrix(~0+devStadium)

Then you can still compute a contrast to test for the mean expression, 
if you really care about that.

Best,

Jim

>
> Furthermore, I would like to perform an F test over all the
> developmental stadia I have sampled. In total, there are seven
> stadia. However, the application of this F test is uncertain to me.
> According to the manual, the F-statistic gained from the eBayes step
> tests whether any of the contrasts are non-zero for that gene.
> However, since I have defined an intercept in my contrasts, this
> might not be the way to go. Therefore, I thought of making a small
> contrast matrix (L) myself, using the following code:
>
> L <- matrix(0,nrow=8,ncol=1) L[2:8,1] <- 1 null.stadia.table <-
> topTable(fit.null,coef=L,n=nrow(fit.null),adjust.method="BH")
> mean(null.stadia.table$adj.P.Val < 0.05)
>
> My reasoning here is that the 1’s represent the coefficients to be
> tested (thus the seven developmental stadia). The results of this
> output show me that EXACTLY the same amount of genes turn out to be
> significant than with testing the intercept.
>
> I don’t know why this is, but it makes me believe my approach is
> wrong. If anyone can explain me why, this would lighten things up for
> me. Anyway, I proceeded using something else
>
> null.stadia.table <-
> topTable(fit.null,coef=c(2:8),n=nrow(fit.null),adjust.method="BH")
> mean(null.stadia.table$adj.P.Val < 0.05)
>
> This approach is similar to using the contrast matrix
>
> L <- matrix(0,nrow=7,ncol=7) diag(L) <- 1 rbind(rep(0,7),L)
>
> Here, I state testing the seven developmental stadium coefficients,
> and I get a lot less genes that turn out to be significant than with
> the two approaches above. My primary question aims at the differences
> between both contrast matrices I have constructed in testing while
> using topTable.
>
> Thank you in advance, Koen
> _______________________________________________ Bioconductor mailing
> list Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099