[BioC] missing values in limma/contrasts.fit

Thu Dec 17 00:13:29 CET 2009

It might be useful to see the code pipeline that your biological 
colleagues used.

Best wishes
Gordon

On Wed, 16 Dec 2009, Albyn Jones wrote:

> Gordon
>
> Thanks for the toughtful response.  I should have included an example;
> I will put a small one together soon.  I am inmeshed in giving and grading
> exams at the moment, so it won't happen immediately.
>
> albyn
>
> On Wed, Dec 16, 2009 at 10:09:59AM +1100, Gordon K Smyth wrote:
>> Dear Albyn,
>>
>> It is always helpful to give a code example, preferably something which can
>> be run by others, but at least showing us the details of the situation.
>> There isn't any way to confirm from your email that the issue you mention
>> is the root cause of the problem.  Your colleague's data example would have
>> to quite unusual for this to be true.
>>
>> I have plenty of sympathy with researchers carefully examining the
>> microarray image and marking scratches or other imperfections, but it is
>> hard to imagine that they would flag tens of thousands of spots in this way
>> and, if they did, it might be better to drop the array entirely.
>>
>> lmFit() already computes exact standard errors.
>>
>> In many large data problems, in which many contrasts might be experimented
>> with, it is convenient to be able to compute a basic fit object and then to
>> apply various contrasts posthoc.  This was especially true in the early
>> days of microarray analysis with somewhat slower computers than today. This
>> is the reason for the separate contrasts.fit() function.
>>
>> I made an early design decision that fit objects could be order
>> ngenes*nparameters in size, but not order ngenes*nparameters^2, so it was
>> not possible to store genewise QR decompositions.  This leads to the
>> computational efficiency issue, which applies to contrasts.fit, not to
>> lmFit.
>>
>> If a dataset has so many spot-specific weights of different sizes, so that
>> the approximation involved in contrasts.fit() is not trustworthy, then it
>> is simple enough to run lmFit() a couple of times with different
>> parametrizations to get all the contrasts desired.
>>
>> I have considered in the past adding a 'contrasts' argument to lmFit, so
>> that users could fit the basic regression and simultaneously extract all
>> desired contrasts, using the exact covariance matrices. (This is not quite
>> as simple as you might think, because lmFit supports a number of different
>> covariance structures, robust fits etc, so the changes have to be made in a
>> number of low level functions.  And it changes the structure of the fit
>> object which is passed on.)  I can still do so if there is a demand for it.
>>
>> Best wishes
>> Gordon
>>
>>> Date: Mon, 14 Dec 2009 11:15:05 -0800
>>> From: Albyn Jones <jones at reed.edu>
>>> Subject: [BioC] missing values in limma/contrasts.fit
>>> To: bioconductor at stat.math.ethz.ch
>>> Message-ID: <20091214191505.GD11908 at laplace.reed.edu>
>>> Content-Type: text/plain; charset=us-ascii
>>>
>>> Dear BioConductor Folk
>>>
>>> The help file for contrasts.fit states:
>>>
>>>     "Warning. For efficiency reasons, this function does not
>>>     re-factorize the design matrix for each probe. A consequence is
>>>     that, if the design matrix is non-orthogonal and the original fit
>>>     included quality weights or missing values, then the unscaled
>>>     standard deviations produced by this function are approximate
>>>     rather than exact. The approximation is usually acceptable...."
>>>
>>> My attention was attracted to the statement when a colleague in
>>> biology asked me why one would get different sets of probes identified
>>> as differentially expressed, depending on which individual or
>>> biological sample was selected as the reference in a balanced loop
>>> design.
>>>
>>> My experience, admittedly limited, suggests that the computational
>>> efficiency gain is not worth the loss of accuracy.  Even if one has to
>>> sacrifice the efficiency of a single pass through the raw data, at
>>> least one gets correct results.  I have hacked a version of lmFit to
>>> evaluate contrasts with standard errors based on the exact covariance
>>> matrix.  It runs esssentially as quickly as lmFit, so I find the
>>> efficiency argument uncompelling.
>>>
>>> A search of the archive produced several discussions of missing values
>>> in limma.  The main argument I see is Gordon Smyth's (Date: 2008-03-08)
>>>
>>>   "The ideal solution is not to introduce missing values into your
>>>    data in the first place.  In my experimence, missing values are
>>>    almost always avoidable.  I have never seen a situation where it
>>>    was necessary or desirable to introduce a large proportion of
>>>    missing values."
>>>
>>> My colleagues in biology report that they inspect their arrays
>>> visually and note probes which have been scratched, probes covered by
>>> background blobs and the like.  These categories seem to satisfy the
>>> missing-at-random criterion: the probe is marked NA not because it is
>>> saturated or below background, but because it was unreadable for
>>> reasons unrelated to the response.
>>>
>>> I'd appreciate feedback: has anyone else already done this? Would
>>> others find this useful?  Are there objections I have overlooked?
>>>
>>> albyn
>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the addressee.
>> You must not disclose, forward, print or use it without the permission of the sender.
>> ______________________________________________________________________
>>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}