[BioC] limma, contrasts, and NA's

Gordon K Smyth smyth at wehi.EDU.AU
Tue Aug 17 06:25:58 CEST 2010


Dear Albyn,

Many thanks for the very nice reproducible and complete data example.

I've always agreed that spot-specific NA's do have an effect on the 
results from contrasts.fit(), and the results can change a bit with choice 
of reference.  I will try to add some extra functionality to lmFit() so 
that users can extract contrasts at a stage when the genewise QR 
decompositions are still available, so all calculations can be done 
exactly.  Thanks for sending me your version of code to do that.

Let me come back though to the theme of whether there should be so many 
NAs in the data in the first place, and this is partly why I asked you for 
a reproducible example.  In my experience, if there are enough NAs to 
cause problems for contrasts.fit(), then it is usually a symptom that too 
much spot flagging is being done.  I hope you won't mind if I make some 
comments on the filtering in your example.  I suspect that the filtering 
was probably imposed on you by your collaborators anyway.

In your data example, nearly 40% of all spots on all arrays are being 
marked as NA.  In my opinion, this is seriously over the top, unless the 
arrays are so bad as to be useless.  I am very strongly opposed to the 
notion of filtering faint spots.  Why remove perfectly good data just 
because it is a low value rather than a high value?  I also don't have 
much confidence in GenePix flags.  They are mainly a defense against 
things that we now know how to control.  In my experience, biologists 
overestimate their ability to flag bad spots.  On the other hand, 
filtering empty or non-expressed probes is a good idea, but whole probes 
should be removed, not isolated spots on individual arrays.  Finally, 
filtering should be after the normalization rather than before.

I would probably pre-process your data like this:

library(limma)
url <- "http://people.reed.edu/~renns/jones_renn/maternal_data"
targets <- readTargets("targets_fixed.csv",path=url,sep=",")
RG <- read.maimages(targets,path=url,source="genepix.median")
RGb <- 
backgroundCorrect(RG,method="normexp",normexp.method="mle",offset=50)
MA<- normalizeWithinArrays(RGb)
Empty <- MA$genes$ID %in% c("empty","Empty")
fit <- lmFit(MA[!Empty,], design)

etc.  Then there would be no missing values, and the whole issue of 
approximate calculation wouldn't arise.

Best wishes
Gordon

--------- original message ------------
[BioC] limma, contrasts, and NA's
Albyn Jones jones at reed.edu
Fri Aug 13 20:42:14 CEST 2010

Following up on a thread I initiated last year on missing values
in limma and contrasts.fit
      (see
       gmane.science.biology.informatics.conductor:26494
       gmane.science.biology.informatics.conductor:26483)
I have prepared an example illustrating the fact that when there are
missing data, the contrast.fit function not only doesn't give exact
results, but the results depend on the source chosen as the
"reference" in modelMatrix().  When I pointed out to my colleagues in
biology that they could manually construct a design matrix that would
fit contrasts exactly, they rebelled at the need to construct N-1 linearly
indpendent columns for N sources.  The utility of contrasts.fit() is
clear, avoiding the need for non-statisticians to learn to code design
matrices.

albyn
-- 
Albyn Jones
Reed College
jones at reed.edu

=========================================================================
library(limma)
### necessary files and data are at:
url <- "http://people.reed.edu/~renns/jones_renn/maternal_data/"

targets <- readTargets(paste(url, "targets_fixed.csv", sep = ""),  sep = 
",")

RG <- read.maimages(paste(url, targets$FileName, sep = ""), columns =
     list(R="F635 Median", G="F532 Median", Rb="B635 Median", Gb="B532 
Median"),
     other.columns = c("Flags","B635 SD","B532 SD","F635 % Sat.","F532 % 
Sat."))
names(RG$other) <- c("Flags", "RbSD", "GbSD","RSat" ,"GSat" )

RG$genes <- readGAL(paste(url,"Fishchip4.03_annotatedGAL_20090730HEM.gal",
             sep = ""))

RG$printer <- getLayout(RG$genes)

### Filter features manually flagged bad and automatedly flagged "not 
found"
RG$R[RG$other$Flags < 0]<-NA
RG$G[RG$other$Flags < 0]<-NA

as.numeric(apply(is.na(RG$R),2,sum))

### Filter faint features
RG$R[(RG$R < (RG$Rb + 2*RG$other$RbSD))&(RG$G < (RG$Gb + 
2*RG$other$GbSD))]<-0
RG$G[RG$R == 0]<-NA
RG$R[RG$R == 0]<-NA

as.numeric(apply(is.na(RG$R),2,sum))
table(apply(is.na(RG$R),1,sum))

### Normalize
MA<- normalizeWithinArrays(RG, method = "printtiploess", bc.method = 
"minimum")

### set up the design matrices
design1<- modelMatrix(targets, ref = "a1126")
design2<- modelMatrix(targets, ref = "a1217")

### contrast coding
cont1 <-  makeContrasts(WvsL = ( - a1111 - a1210 + a0201 + a1217 +
     a1220 - a1211 - a1219 + a1128 + a1202 - a1028 - a0128)/6, 
levels=design1)
cont2 <- makeContrasts(WvsL = (a1126 - a1111 - a1210 + a0201  +
     a1220 - a1211 - a1219 + a1128 + a1202 - a1028 - a0128)/6, 
levels=design2)

### analysis
fit1 <- lmFit (MA, design1)
fit.cont1 <-  contrasts.fit(fit1, cont1)
fit.ebayes1 <- eBayes(fit.cont1)

fit2 <- lmFit (MA, design2)
fit.cont2 <-  contrasts.fit(fit2, cont2)
fit.ebayes2 <- eBayes(fit.cont2)

results1 <- decideTests(fit.ebayes1, method = "separate",
             adjust.method="none", p.value=0.05)
results2 <- decideTests(fit.ebayes2, method = "separate",
             adjust.method="none", p.value=0.05)

res<-cbind(results1[,1], results2[,1])
colnames(res)<-c("ref1", "ref2")
vennCounts(res, include="both")
#     ref1 ref2 Counts
#[1,]    0    0  11862
#[2,]    0    1     30
#[3,]    1    0    100
#[4,]    1    1    814


______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list