[BioC] limma with missing values

Sun Mar 9 01:41:34 CET 2008

On Sat, 8 Mar 2008, Hans-Ulrich Klein wrote:

Gordon K Smyth wrote:
>> limma treats missing values in exactly the same way as other linear model 
>> function in R, such as lm(), glm() etc.  For each gene, the rows with 
>> missing values are removed both from the data and design matrix, then 
>> non-estimable coefficients are removed (given NA) from the end of the 
>> design matrix.  See the documentation for those functions for more 
>> information.
>
> Yes. This behaviour makes sense for lm(), where only one model is fitted and 
> the user knows that a coefficient can not be estimated due to NAs.

I find it hard to accept that a behaviour can be sensible for lm() but not 
lmFit(), since they're both fitting linear models.  On the other hand, the 
behaviour may not be sensible for lm() either.  See below.

> I found it confusing how limma stores and returns the results. Consider an 
> even shorter version of the example in my previous mail (attached at the end 
> of this mail). The design matrix is modeled such that the coefficient 
> "treatT2" is the difference between T2 and T1 (T1 is the baseline). The 
> column "treatT2" of Fit$coefficients stores the estimated difference T2-T1 
> for gene 1 but T2-T3 for gene 3. Calling topTable() with coef="treatT2" to 
> compare T2 and T1 is misleading.

I understand your point.  However it is hard to think of another strategy 
for accommodating missing values in lmFit() that would be universally 
applicable and interprettable.  Can you suggest one?

Note that lm() has the same problem.  Using lm() you could fit the model 
for gene3 either by

   > lm(X[3,]~0+D)$coef
   D(Intercept)     DtreatT2     DtreatT3
              6           -1           NA

or by

   > lm(X[3,]~df$treat)$coef
   (Intercept)  df$treatT3
             5           1

and both are wrong, in the sense that the T2 and T3 coefficients cannot be 
interpretted as the T2-T1 and T3-T1 contrasts.

Note that you can use limma as it is now to get exactly the correct answer 
in all cases.  You could proceed by

   design <- model.matrix(~0+df$treat)
   fit <- lmFit(X,design)

Now if you want to compare T2 to T1, you could remove T3 and compute the 
contast by

   > fit2 <- contrasts.fit(fit[,-3],c(-1,1))
   > fit2$coef
      [,1]
   g1    1
   g2   NA
   g3   NA

which is exactly correct.  Similarly to compare T3 to T1

> fit3 <- contrasts.fit(fit[,-2],c(-1,1))
> fit3$coef
    [,1]
g1   NA
g2    1
g3   NA

> The problem is that often a user will not notice when limma drops a column of 
> the design matrix for a few of many thousand genes. Maybe a warning message 
> should be printed?

Maybe.  On one hand, I'm reluctant to introduce a warning a message in a 
situation in which lm() does not.  One the other hand, I agree with you 
that linear model formulae and missing values don't mix when one of the 
coefficients becomes inestimable.  The only way to get the correct answer 
is to use the group-mean parametrisation and form contrasts explicitly.

>> The ideal solution is not to introduce missing values into your data in the 
>> first place.  In my experimence, missing values are almost always 
>> avoidable.  I have never seen a situation where it was necessary or 
>> desirable to introduce a large proportion of missing values.
>
> About 10%-15% of the spots are flagged by the image analysis software 
> (GenePix) mainly due to saturation. Perhaps the scanner was poorly adjusted. 
> Probably it is a good idea to include them in the analysis with low weights?

Yes, definitely, it would be better to include them, whether or not you 
use low weights.  Marking these spots as NA is incorrect, because linear 
model functions such as lmFit() have to assume that they are 
missing-at-random, which they are not.  They are actually the largest 
intensities in the dataset.

Similar remarks apply to other spots flagged by GenePix because they are 
faint.  Again it would be incorrect to convert them to NA, or to give them 
zero weight.

Best wishes
Gordon

> Best wishes,
> Hans-Ulrich
>
>
>
>
> Here is the example:
>
> df = data.frame(g1=c(5,5,6,6,NA,NA),
>                g2=c(5,5,NA,NA,6,6),
>                g3=c(NA,NA,5,5,6,6),
>                treat=c("T1","T1","T2","T2","T3","T3"))
> conts = list(treat="contr.treatment")
> X = t(df[,1:3])
> D = model.matrix( ~ treat, data=df, contrasts=conts)
> Fit = lmFit(X, D)
> Fit$coefficients
>   (Intercept) treatT2 treatT3
> g1           5       1      NA
> g2           5      NA       1
> g3           6      -1      NA