[BioC] Almost inexisting overlap of diff. expr. genes found when comparing mas5 / rma

Sat Jul 9 10:09:02 CEST 2005

Yes we often see poor overlaps. A 40 - 50 % overlap is considered
pretty good but rare unless you are considering the top 5 genes
in both list or something silly like that.

To make a fair comparison, try comparing the lists when they are
both filtered by the same p-value cutoff or statistics rather than
arbitrarily choosing a numbers.

Further, two minor cosmetic points about your code

1) If you look at your design matrix from 

 strain = c("WT","WT","WT","Drug","Drug","Drug")
 design = model.matrix(~factor(strain))
 colnames(design) = c("WT","Drug")
 design
  WT Drug
1  1    1
2  1    1
3  1    1
4  1    0
5  1    0
6  1    0

the first column represents an intercept not WT. To get the
correct interpretation, you need to change the second line to 

 design = model.matrix(~ -1 + factor(strain) )

2) You do not need the force the rownames to numeric using 
as.numeric() since intersect happily works with characters.

 x <- c("a", "b", "c")
 y <- c("b", "c", "d")
 intersect(x,y)
[1] "b" "c"

But I do not think either of these point change your results.

On Fri, 2005-07-08 at 18:18 +0100, Emmanuel Levy wrote:
> Dear Bioconductor community,
> 
> I've been looking for differentially expressed genes in C. elegans after a 
> drug treatment.
> There are 3 replicates of each condition and 2 conditions in total (WT and 
> Drug)
> I used limma combined with either rma or mas5. I find a very very poor 
> overlap in the results:
> 
> - example (i) only 24 of the 100 most differentially expressed genes 
> obtained using rma are found in
> the 1000 most differentially expressed genes obtained using mas5
> - example (ii) only 183 genes are common to the lists of the 1000 most 
> differentially expressed genes
> found using both methods.
> (see piece of code at the end)
> 
> Either 
> 1/ I am missing something which I would'nt be surprised of, as my expertise 
> is very limited.
> 
> In that case I am sorry for pointing out something irrelevant and thank you 
> in advance for telling
> me what I'm missing,
> 
> 2/ The differences in the normalization methods are really at the origin of 
> the observed differences.
> In that case, how can I know which method is the best for my case study? 
> Does a helpful paper exists 
> which explains in simple words the strengths/weaknesses of each method?
> 
> Thank you very much in advance for your help,
> 
> Emmanuel
> 
> -------------------------------------- CODE 
> --------------------------------------
> library(affy)
> library(limma)
> 
> # Load data into Affybatch
> data = ReadAffy(widget=T)
> 
> # Background correction / normalization
> eset.rma = rma(data)
> eset.mas = mas5(data)
> 
> # Get Expression values
> exp.rma = exprs(eset.rma)
> exp.mas = exprs(eset.mas)
> 
> # --- Look for differentially expressed genes using Limma package
> strain = c("WT","WT","WT","Drug","Drug","Drug")
> design = model.matrix(~factor(strain))
> colnames(design) = c("WT","Drug")
> 
> fit.rma = lmFit(eset.rma,design)
> fit.mas = lmFit(eset.mas,design)
> 
> fit.rma.2 = eBayes(fit.rma)
> fit.mas.2 = eBayes(fit.mas)
> 
> top.rma = as.numeric(rownames(topTable(fit.rma.2,n=1000)))
> top.mas = as.numeric(rownames(topTable(fit.mas.2,n=100)))
> length(intersect(top.rma,top.mas))
> > [1] 24
> 
> top.rma = as.numeric(rownames(topTable(fit.rma.2,n=100)))
> top.mas = as.numeric(rownames(topTable(fit.mas.2,n=1000)))
> length(intersect(top.rma,top.mas))
> > [1] 0
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>