[BioC] missing values in limma/contrasts.fit
Gordon K Smyth
smyth at wehi.EDU.AU
Wed Dec 16 00:09:59 CET 2009
Dear Albyn,
It is always helpful to give a code example, preferably something which
can be run by others, but at least showing us the details of the
situation. There isn't any way to confirm from your email that the issue
you mention is the root cause of the problem. Your colleague's data
example would have to quite unusual for this to be true.
I have plenty of sympathy with researchers carefully examining the
microarray image and marking scratches or other imperfections, but it is
hard to imagine that they would flag tens of thousands of spots in this
way and, if they did, it might be better to drop the array entirely.
lmFit() already computes exact standard errors.
In many large data problems, in which many contrasts might be experimented
with, it is convenient to be able to compute a basic fit object and then
to apply various contrasts posthoc. This was especially true in the early
days of microarray analysis with somewhat slower computers than today.
This is the reason for the separate contrasts.fit() function.
I made an early design decision that fit objects could be order
ngenes*nparameters in size, but not order ngenes*nparameters^2, so it was
not possible to store genewise QR decompositions. This leads to the
computational efficiency issue, which applies to contrasts.fit, not to
lmFit.
If a dataset has so many spot-specific weights of different sizes, so that
the approximation involved in contrasts.fit() is not trustworthy, then it
is simple enough to run lmFit() a couple of times with different
parametrizations to get all the contrasts desired.
I have considered in the past adding a 'contrasts' argument to lmFit, so
that users could fit the basic regression and simultaneously extract all
desired contrasts, using the exact covariance matrices. (This is not quite
as simple as you might think, because lmFit supports a number of different
covariance structures, robust fits etc, so the changes have to be made in
a number of low level functions. And it changes the structure of the fit
object which is passed on.) I can still do so if there is a demand for
it.
Best wishes
Gordon
> Date: Mon, 14 Dec 2009 11:15:05 -0800
> From: Albyn Jones <jones at reed.edu>
> Subject: [BioC] missing values in limma/contrasts.fit
> To: bioconductor at stat.math.ethz.ch
> Message-ID: <20091214191505.GD11908 at laplace.reed.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Dear BioConductor Folk
>
> The help file for contrasts.fit states:
>
> "Warning. For efficiency reasons, this function does not
> re-factorize the design matrix for each probe. A consequence is
> that, if the design matrix is non-orthogonal and the original fit
> included quality weights or missing values, then the unscaled
> standard deviations produced by this function are approximate
> rather than exact. The approximation is usually acceptable...."
>
> My attention was attracted to the statement when a colleague in
> biology asked me why one would get different sets of probes identified
> as differentially expressed, depending on which individual or
> biological sample was selected as the reference in a balanced loop
> design.
>
> My experience, admittedly limited, suggests that the computational
> efficiency gain is not worth the loss of accuracy. Even if one has to
> sacrifice the efficiency of a single pass through the raw data, at
> least one gets correct results. I have hacked a version of lmFit to
> evaluate contrasts with standard errors based on the exact covariance
> matrix. It runs esssentially as quickly as lmFit, so I find the
> efficiency argument uncompelling.
>
> A search of the archive produced several discussions of missing values
> in limma. The main argument I see is Gordon Smyth's (Date: 2008-03-08)
>
> "The ideal solution is not to introduce missing values into your
> data in the first place. In my experimence, missing values are
> almost always avoidable. I have never seen a situation where it
> was necessary or desirable to introduce a large proportion of
> missing values."
>
> My colleagues in biology report that they inspect their arrays
> visually and note probes which have been scratched, probes covered by
> background blobs and the like. These categories seem to satisfy the
> missing-at-random criterion: the probe is marked NA not because it is
> saturated or below background, but because it was unreadable for
> reasons unrelated to the response.
>
> I'd appreciate feedback: has anyone else already done this? Would
> others find this useful? Are there objections I have overlooked?
>
> albyn
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list