[BioC] "True" normalization factors in edgeR
Mark Robinson
mark.robinson at imls.uzh.ch
Sat Aug 9 16:34:58 CEST 2014
Hi Tianyu,
Sorry for the slow response (was away).
I have read your attachment, but your question and simulation is still a little unclear to me. First, some useful diagnostics for me to see whether this simulation is in any way reasonable to a real RNA-seq dataset are an M-A plot -- plotSmear(), perhaps highlighting the true DE genes you have put in -- or a dispersion-mean plot -- plotBCV(). Also, there is no code in there to show exactly what you did when you say 'I replaced the "norm" in third line of above codes with the "true normalization factors"'; what did you exactly do? Maybe send a completely reproducible example ? Give your sessionInfo() too.
> I am wondering what is a better procedure
> of finding the "true" normalization factors in edgeR, given which genes are
> none-DE?
The other concern is that, in practice, you never know what is non-DE. So, how is this useful?
Cheers, Mark
----------
Prof. Dr. Mark Robinson
Statistical Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://ow.ly/riRea
On 04.08.2014, at 22:44, Zhan Tianyu <sewen67 at gmail.com> wrote:
> Hi all,
>
> I have a question concerning the normalization factors in edgeR. I
> think that if we know which genes are none-DE genes in advance, we could
> calculate "true" normalization factors based on those information. Given
> "true normalization factors", edgeR could find more right DE genes, or even
> find all the right DE genes when the simulation is simple.
> The procedures I used in edgeR are:
> library(edgeR)
> dge <- DGEList(counts = counts, group = group )
> norm <- calcNormFactors(dge,method="TMM")
> d <- estimateGLMCommonDisp(norm, design = design)
> d <- estimateGLMTrendedDisp(d,design=design)
> d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10)
> f <- glmFit(d, design = design)
> lr <- glmLRT(f, coef=2)
> pval = lr$table$PValue
> padj = p.adjust(pval, "BH")
> cbind(pval = pval, padj = padj)
> I used the order of p-value or adjust p-value to label DE genes
> (for example, the first 300 genes in the order of p value from low to
> hight, in a scenario with 30% DE in 1000 genes). In a simple simulation
> with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80
> wrong DE genes in 300 DE genes. The simulation method is in the attachment,
> and I used 10 genes as a demonstration.
> Then I tried two methods to calculate "true" normalization
> factors". The first one is using calnormFactors() function, but using
> counts from all none-DE genes. The second one is taking log of all the
> none-DE counts, and then calculate the median of
> log(counts)[,i]-log(counts)[,1], where i is the index of each individual
> (each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm)))
> to make sure that all factors multiple to one.
> After that, I replaced the "norm" in third line of above codes
> with the "true normalization factors". However, both methods would find 110
> wrong DE genes, which is higher than the false discovery rates of default
> edgeR method (80 wrong DE genes). I am wondering what is a better procedure
> of finding the "true" normalization factors in edgeR, given which genes are
> none-DE?
> Thank you!
>
> Best regards,
> Tianyu Zhan
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list