[BioC] "True" normalization factors in edgeR

Sat Aug 9 16:34:58 CEST 2014

Hi Tianyu,

Sorry for the slow response (was away).

I have read your attachment, but your question and simulation is still a little unclear to me.  First, some useful diagnostics for me to see whether this simulation is in any way reasonable to a real RNA-seq dataset are an M-A plot -- plotSmear(), perhaps highlighting the true DE genes you have put in -- or a dispersion-mean plot -- plotBCV().  Also, there is no code in there to show exactly what you did when you say 'I replaced the "norm" in third line of above codes with the "true normalization factors"'; what did you exactly do?  Maybe send a completely reproducible example ?  Give your sessionInfo() too.

> I am wondering what is a better procedure
> of finding the "true" normalization factors in edgeR, given which genes are
> none-DE?

The other concern is that, in practice, you never know what is non-DE.  So, how is this useful?

Cheers, Mark

----------
Prof. Dr. Mark Robinson
Statistical Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://ow.ly/riRea

On 04.08.2014, at 22:44, Zhan Tianyu <sewen67 at gmail.com> wrote:

> Hi all,
> 
>       I have a question concerning the normalization factors in edgeR. I
> think that if we know which genes are none-DE genes in advance, we could
> calculate "true" normalization factors based on those information. Given
> "true normalization factors", edgeR could find more right DE genes, or even
> find all the right DE genes when the simulation is simple.
>       The procedures I used in edgeR are:
>           library(edgeR)
>           dge <- DGEList(counts = counts, group = group )
>           norm <- calcNormFactors(dge,method="TMM")
>           d <- estimateGLMCommonDisp(norm, design = design)
>           d <- estimateGLMTrendedDisp(d,design=design)
>           d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10)
>           f <- glmFit(d, design = design)
>           lr <- glmLRT(f, coef=2)
>           pval = lr$table$PValue
>           padj = p.adjust(pval, "BH")
>           cbind(pval = pval, padj = padj)
>        I used the order of p-value or adjust p-value to label DE genes
> (for example, the first 300 genes in the order of p value from low to
> hight, in a scenario with 30% DE in 1000 genes). In a simple simulation
> with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80
> wrong DE genes in 300 DE genes. The simulation method is in the attachment,
> and I used 10 genes as a demonstration.
>        Then I tried two methods to calculate "true" normalization
> factors". The first one is using calnormFactors() function, but using
> counts from all none-DE genes. The second one is taking log of all the
> none-DE counts, and then calculate the median of
> log(counts)[,i]-log(counts)[,1], where i is the index of each individual
> (each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm)))
> to make sure that all factors multiple to one.
>         After that, I replaced the "norm" in third line of above codes
> with the "true normalization factors". However, both methods would find 110
> wrong DE genes, which is higher than the false discovery rates of default
> edgeR method (80 wrong DE genes). I am wondering what is a better procedure
> of finding the "true" normalization factors in edgeR, given which genes are
> none-DE?
>         Thank you!
> 
> Best regards,
> Tianyu Zhan
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor