[BioC] missing values in limma/contrasts.fit

Mon Dec 14 20:15:05 CET 2009

Dear BioConductor Folk

The help file for contrasts.fit states:

     "Warning. For efficiency reasons, this function does not
     re-factorize the design matrix for each probe. A consequence is
     that, if the design matrix is non-orthogonal and the original fit
     included quality weights or missing values, then the unscaled
     standard deviations produced by this function are approximate
     rather than exact. The approximation is usually acceptable...."

My attention was attracted to the statement when a colleague in
biology asked me why one would get different sets of probes identified
as differentially expressed, depending on which individual or
biological sample was selected as the reference in a balanced loop
design.  

My experience, admittedly limited, suggests that the computational
efficiency gain is not worth the loss of accuracy.  Even if one has to
sacrifice the efficiency of a single pass through the raw data, at
least one gets correct results.  I have hacked a version of lmFit to
evaluate contrasts with standard errors based on the exact covariance
matrix.  It runs esssentially as quickly as lmFit, so I find the
efficiency argument uncompelling.

A search of the archive produced several discussions of missing values
in limma.  The main argument I see is Gordon Smyth's (Date: 2008-03-08)

   "The ideal solution is not to introduce missing values into your
    data in the first place.  In my experimence, missing values are
    almost always avoidable.  I have never seen a situation where it
    was necessary or desirable to introduce a large proportion of
    missing values."

My colleagues in biology report that they inspect their arrays
visually and note probes which have been scratched, probes covered by
background blobs and the like.  These categories seem to satisfy the
missing-at-random criterion: the probe is marked NA not because it is
saturated or below background, but because it was unreadable for
reasons unrelated to the response.

I'd appreciate feedback: has anyone else already done this? Would
others find this useful?  Are there objections I have overlooked?

albyn