[BioC] limma with missing values

Gordon K Smyth smyth at wehi.EDU.AU
Sun Mar 9 01:41:34 CET 2008

On Sat, 8 Mar 2008, Hans-Ulrich Klein wrote:

Gordon K Smyth wrote:
>> limma treats missing values in exactly the same way as other linear model
>> function in R, such as lm(), glm() etc.  For each gene, the rows with
>> missing values are removed both from the data and design matrix, then
>> non-estimable coefficients are removed (given NA) from the end of the
>> design matrix.  See the documentation for those functions for more
>> information.
>
> Yes. This behaviour makes sense for lm(), where only one model is fitted and
> the user knows that a coefficient can not be estimated due to NAs.

I find it hard to accept that a behaviour can be sensible for lm() but not
lmFit(), since they're both fitting linear models.  On the other hand, the
behaviour may not be sensible for lm() either.  See below.

> I found it confusing how limma stores and returns the results. Consider an
> even shorter version of the example in my previous mail (attached at the end
> of this mail). The design matrix is modeled such that the coefficient
> "treatT2" is the difference between T2 and T1 (T1 is the baseline). The
> column "treatT2" of Fit\$coefficients stores the estimated difference T2-T1
> for gene 1 but T2-T3 for gene 3. Calling topTable() with coef="treatT2" to
> compare T2 and T1 is misleading.

I understand your point.  However it is hard to think of another strategy
for accommodating missing values in lmFit() that would be universally
applicable and interprettable.  Can you suggest one?

Note that lm() has the same problem.  Using lm() you could fit the model
for gene3 either by

> lm(X[3,]~0+D)\$coef
D(Intercept)     DtreatT2     DtreatT3
6           -1           NA

or by

> lm(X[3,]~df\$treat)\$coef
(Intercept)  df\$treatT3
5           1

and both are wrong, in the sense that the T2 and T3 coefficients cannot be
interpretted as the T2-T1 and T3-T1 contrasts.

Note that you can use limma as it is now to get exactly the correct answer
in all cases.  You could proceed by

design <- model.matrix(~0+df\$treat)
fit <- lmFit(X,design)

Now if you want to compare T2 to T1, you could remove T3 and compute the
contast by

> fit2 <- contrasts.fit(fit[,-3],c(-1,1))
> fit2\$coef
[,1]
g1    1
g2   NA
g3   NA

which is exactly correct.  Similarly to compare T3 to T1

> fit3 <- contrasts.fit(fit[,-2],c(-1,1))
> fit3\$coef
[,1]
g1   NA
g2    1
g3   NA

> The problem is that often a user will not notice when limma drops a column of
> the design matrix for a few of many thousand genes. Maybe a warning message
> should be printed?

Maybe.  On one hand, I'm reluctant to introduce a warning a message in a
situation in which lm() does not.  One the other hand, I agree with you
that linear model formulae and missing values don't mix when one of the
coefficients becomes inestimable.  The only way to get the correct answer
is to use the group-mean parametrisation and form contrasts explicitly.

>> The ideal solution is not to introduce missing values into your data in the
>> first place.  In my experimence, missing values are almost always
>> avoidable.  I have never seen a situation where it was necessary or
>> desirable to introduce a large proportion of missing values.
>
> About 10%-15% of the spots are flagged by the image analysis software
> (GenePix) mainly due to saturation. Perhaps the scanner was poorly adjusted.
> Probably it is a good idea to include them in the analysis with low weights?

Yes, definitely, it would be better to include them, whether or not you
use low weights.  Marking these spots as NA is incorrect, because linear
model functions such as lmFit() have to assume that they are
missing-at-random, which they are not.  They are actually the largest
intensities in the dataset.

Similar remarks apply to other spots flagged by GenePix because they are
faint.  Again it would be incorrect to convert them to NA, or to give them
zero weight.

Best wishes
Gordon

> Best wishes,
> Hans-Ulrich
>
>
>
>
> Here is the example:
>
> df = data.frame(g1=c(5,5,6,6,NA,NA),
>                g2=c(5,5,NA,NA,6,6),
>                g3=c(NA,NA,5,5,6,6),
>                treat=c("T1","T1","T2","T2","T3","T3"))
> conts = list(treat="contr.treatment")
> X = t(df[,1:3])
> D = model.matrix( ~ treat, data=df, contrasts=conts)
> Fit = lmFit(X, D)
> Fit\$coefficients
>   (Intercept) treatT2 treatT3
> g1           5       1      NA
> g2           5      NA       1
> g3           6      -1      NA