[BioC] limma with missing values

Hans-Ulrich Klein h.klein at uni-muenster.de
Sat Mar 8 19:17:28 CET 2008


Dear Gordon,

thank you very much for your reply!

Gordon K Smyth wrote:
> The output that you give does not match your code.[...]

I am sorry for that. I wrote
conts = list(treat="contr.sum", class="contr.treatment")
in my posting but actually used
conts = list(treat="contr.treatment", class="contr.treatment").

> limma treats missing values in exactly the same way as other linear 
> model function in R, such as lm(), glm() etc.  For each gene, the rows 
> with missing values are removed both from the data and design matrix, 
> then non-estimable coefficients are removed (given NA) from the end of 
> the design matrix.  See the documentation for those functions for more 
> information.

Yes. This behaviour makes sense for lm(), where only one model is fitted 
and the user knows that a coefficient can not be estimated due to NAs.

I found it confusing how limma stores and returns the results. Consider 
an even shorter version of the example in my previous mail (attached at 
the end of this mail). The design matrix is modeled such that the 
coefficient "treatT2" is the difference between T2 and T1 (T1 is the 
baseline). The column "treatT2" of Fit$coefficients stores the estimated 
difference T2-T1 for gene 1 but T2-T3 for gene 3. Calling topTable() 
with coef="treatT2" to compare T2 and T1 is misleading.

The problem is that often a user will not notice when limma drops a 
column of the design matrix for a few of many thousand genes. Maybe a 
warning message should be printed?

> The ideal solution is not to introduce missing values into your data in 
> the first place.  In my experimence, missing values are almost always 
> avoidable.  I have never seen a situation where it was necessary or 
> desirable to introduce a large proportion of missing values.

About 10%-15% of the spots are flagged by the image analysis software 
(GenePix) mainly due to saturation. Perhaps the scanner was poorly 
adjusted. Probably it is a good idea to include them in the analysis 
with low weights?

Best wishes,
Hans-Ulrich




Here is the example:

df = data.frame(g1=c(5,5,6,6,NA,NA),
                 g2=c(5,5,NA,NA,6,6),
                 g3=c(NA,NA,5,5,6,6),
                 treat=c("T1","T1","T2","T2","T3","T3"))
conts = list(treat="contr.treatment")
X = t(df[,1:3])
D = model.matrix( ~ treat, data=df, contrasts=conts)
Fit = lmFit(X, D)
Fit$coefficients
    (Intercept) treatT2 treatT3
g1           5       1      NA
g2           5      NA       1
g3           6      -1      NA



More information about the Bioconductor mailing list