[BioC] missing values in limma/contrasts.fit

Wed Dec 16 00:09:59 CET 2009

Dear Albyn,

It is always helpful to give a code example, preferably something which 
can be run by others, but at least showing us the details of the 
situation.  There isn't any way to confirm from your email that the issue 
you mention is the root cause of the problem.  Your colleague's data 
example would have to quite unusual for this to be true.

I have plenty of sympathy with researchers carefully examining the 
microarray image and marking scratches or other imperfections, but it is 
hard to imagine that they would flag tens of thousands of spots in this 
way and, if they did, it might be better to drop the array entirely.

lmFit() already computes exact standard errors.

In many large data problems, in which many contrasts might be experimented 
with, it is convenient to be able to compute a basic fit object and then 
to apply various contrasts posthoc.  This was especially true in the early 
days of microarray analysis with somewhat slower computers than today. 
This is the reason for the separate contrasts.fit() function.

I made an early design decision that fit objects could be order 
ngenes*nparameters in size, but not order ngenes*nparameters^2, so it was 
not possible to store genewise QR decompositions.  This leads to the 
computational efficiency issue, which applies to contrasts.fit, not to 
lmFit.

If a dataset has so many spot-specific weights of different sizes, so that 
the approximation involved in contrasts.fit() is not trustworthy, then it 
is simple enough to run lmFit() a couple of times with different 
parametrizations to get all the contrasts desired.

I have considered in the past adding a 'contrasts' argument to lmFit, so 
that users could fit the basic regression and simultaneously extract all 
desired contrasts, using the exact covariance matrices. (This is not quite 
as simple as you might think, because lmFit supports a number of different 
covariance structures, robust fits etc, so the changes have to be made in 
a number of low level functions.  And it changes the structure of the fit 
object which is passed on.)  I can still do so if there is a demand for 
it.

Best wishes
Gordon

> Date: Mon, 14 Dec 2009 11:15:05 -0800
> From: Albyn Jones <jones at reed.edu>
> Subject: [BioC] missing values in limma/contrasts.fit
> To: bioconductor at stat.math.ethz.ch
> Message-ID: <20091214191505.GD11908 at laplace.reed.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Dear BioConductor Folk
>
> The help file for contrasts.fit states:
>
>     "Warning. For efficiency reasons, this function does not
>     re-factorize the design matrix for each probe. A consequence is
>     that, if the design matrix is non-orthogonal and the original fit
>     included quality weights or missing values, then the unscaled
>     standard deviations produced by this function are approximate
>     rather than exact. The approximation is usually acceptable...."
>
> My attention was attracted to the statement when a colleague in
> biology asked me why one would get different sets of probes identified
> as differentially expressed, depending on which individual or
> biological sample was selected as the reference in a balanced loop
> design.
>
> My experience, admittedly limited, suggests that the computational
> efficiency gain is not worth the loss of accuracy.  Even if one has to
> sacrifice the efficiency of a single pass through the raw data, at
> least one gets correct results.  I have hacked a version of lmFit to
> evaluate contrasts with standard errors based on the exact covariance
> matrix.  It runs esssentially as quickly as lmFit, so I find the
> efficiency argument uncompelling.
>
> A search of the archive produced several discussions of missing values
> in limma.  The main argument I see is Gordon Smyth's (Date: 2008-03-08)
>
>   "The ideal solution is not to introduce missing values into your
>    data in the first place.  In my experimence, missing values are
>    almost always avoidable.  I have never seen a situation where it
>    was necessary or desirable to introduce a large proportion of
>    missing values."
>
> My colleagues in biology report that they inspect their arrays
> visually and note probes which have been scratched, probes covered by
> background blobs and the like.  These categories seem to satisfy the
> missing-at-random criterion: the probe is marked NA not because it is
> saturated or below background, but because it was unreadable for
> reasons unrelated to the response.
>
> I'd appreciate feedback: has anyone else already done this? Would
> others find this useful?  Are there objections I have overlooked?
>
> albyn

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}