[BioC] "True" normalization factors in edgeR

Sun Aug 10 13:08:32 CEST 2014

Hi Tianyu,

Good questions get good answers.

If you want my help, you should make it easier for me:

1. Answer my original questions. In particular, look at your data to see what normalization factors you've introduced.  That may give you some indication of if you've done something wrong.

2. Instead of sending me a bunch of files, tell me exactly what I should look at.

Mark

----------
Prof. Dr. Mark Robinson
Statistical Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://ow.ly/riRea

On 09.08.2014, at 20:28, Zhan Tianyu <sewen67 at gmail.com> wrote:

> Hi Professor Robinson:
>      Hello,
>      Thanks for your response. It is true that in reality we never know what is non-DE genes. However, I want to have a better understanding of edgeR and its performance. I thank that if edgeR knows non-DE genes in advance, it would have lower false positive rates (find more right DE genes). 
>      The attachments contains R scripts for data simulation and false positive plots. You could first run the source code of "R_robust.R", "calNorm.R", and "simulation_normal.R". "R_robust.R" is written by you and Xiaobei, and I also added three functions of edgeR: edgdR_median.pfun, edgeR_calc.pfun, and edgeR_ratio.pfun. All the three methods take advantage of known non-DE genes. The details of how I calculate "true" normalization factors are in the file "calNorm.R" file. 
>      Then I tested the default edgeR, and three functions with known non-DE genes in two simulations: normal simulation ( Nsim() function in the file "simulation_normal.R"), and NB simulation ( NBsim() in the file "R_robust"). However, the false positive plots show that default edgeR performed the best (it has the lowest false positive rates). I think the reason is that I am not using the right way to calculate true normalization factors for edgeR, assuming that I know non-DE genes. I am wondering what is a proper way to find the true normalization factors edgeR with non-DE genes? Thanks for your time!
> 
> Best regards,
> Tianyu Zhan 
> 
> 
> 2014-08-09 10:34 GMT-04:00 Mark Robinson <mark.robinson at imls.uzh.ch>:
> Hi Tianyu,
> 
> Sorry for the slow response (was away).
> 
> I have read your attachment, but your question and simulation is still a little unclear to me.  First, some useful diagnostics for me to see whether this simulation is in any way reasonable to a real RNA-seq dataset are an M-A plot -- plotSmear(), perhaps highlighting the true DE genes you have put in -- or a dispersion-mean plot -- plotBCV().  Also, there is no code in there to show exactly what you did when you say 'I replaced the "norm" in third line of above codes with the "true normalization factors"'; what did you exactly do?  Maybe send a completely reproducible example ?  Give your sessionInfo() too.
> 
> > I am wondering what is a better procedure
> > of finding the "true" normalization factors in edgeR, given which genes are
> > none-DE?
> 
> The other concern is that, in practice, you never know what is non-DE.  So, how is this useful?
> 
> Cheers, Mark
> 
> 
> ----------
> Prof. Dr. Mark Robinson
> Statistical Bioinformatics, Institute of Molecular Life Sciences
> University of Zurich
> http://ow.ly/riRea
> 
> 
> 
> 
> 
> 
> 
> On 04.08.2014, at 22:44, Zhan Tianyu <sewen67 at gmail.com> wrote:
> 
> > Hi all,
> >
> >       I have a question concerning the normalization factors in edgeR. I
> > think that if we know which genes are none-DE genes in advance, we could
> > calculate "true" normalization factors based on those information. Given
> > "true normalization factors", edgeR could find more right DE genes, or even
> > find all the right DE genes when the simulation is simple.
> >       The procedures I used in edgeR are:
> >           library(edgeR)
> >           dge <- DGEList(counts = counts, group = group )
> >           norm <- calcNormFactors(dge,method="TMM")
> >           d <- estimateGLMCommonDisp(norm, design = design)
> >           d <- estimateGLMTrendedDisp(d,design=design)
> >           d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10)
> >           f <- glmFit(d, design = design)
> >           lr <- glmLRT(f, coef=2)
> >           pval = lr$table$PValue
> >           padj = p.adjust(pval, "BH")
> >           cbind(pval = pval, padj = padj)
> >        I used the order of p-value or adjust p-value to label DE genes
> > (for example, the first 300 genes in the order of p value from low to
> > hight, in a scenario with 30% DE in 1000 genes). In a simple simulation
> > with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80
> > wrong DE genes in 300 DE genes. The simulation method is in the attachment,
> > and I used 10 genes as a demonstration.
> >        Then I tried two methods to calculate "true" normalization
> > factors". The first one is using calnormFactors() function, but using
> > counts from all none-DE genes. The second one is taking log of all the
> > none-DE counts, and then calculate the median of
> > log(counts)[,i]-log(counts)[,1], where i is the index of each individual
> > (each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm)))
> > to make sure that all factors multiple to one.
> >         After that, I replaced the "norm" in third line of above codes
> > with the "true normalization factors". However, both methods would find 110
> > wrong DE genes, which is higher than the false discovery rates of default
> > edgeR method (80 wrong DE genes). I am wondering what is a better procedure
> > of finding the "true" normalization factors in edgeR, given which genes are
> > none-DE?
> >         Thank you!
> >
> > Best regards,
> > Tianyu Zhan
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> <R scripts Tianyu Zhan.zip>