[BioC] missing value handling in limma
smyth at wehi.edu.au
Tue Jun 8 00:46:00 CEST 2004
At 04:53 AM 8/06/2004, xiaocui zhu wrote:
>I recently used the linear model fit in limma to rank differentially
>expressed genes between treated vs. control with a data set. The data
>includes three log2(Treated/Control) replicate sets, and a dyeSwap for
>each replicate. So the design matrix is c(1,-1,1,-1,1-1). Among the
>top rank genes, I noticed some of them have only one log2Ratio
>measurement with the rest being "NA". I set the log2Ratio of a gene to
>"NA", if its green or red intensity measurement is below background,
>saturated, low intensity, or non-uniform. I am wondering how the linear
>model in limma handles missing values and why a gene with only one data
>point is identified as a high ranking differentially expressed gene.
It is perfectly possible although very unlikely to a gene with only one
non-missing value to be top-ranked. It would have to have an
extraordinarily large fold change for this to happen.
limma handles missing values in the usual way for linear models at the
lmFit() step. A gene with only one value will get df.residual=0. At the
shrinkage step, the residual standard deviation for such a gene will be
reset to the consensus value across all genes, and the corresponding
degrees of freedom will be df.prior. This is explained in the article
Smyth, SAGMB, 2004, cited in the documentation.
>Thank you for your help in advance!
More information about the Bioconductor