[BioC] "True" normalization factors in edgeR

Mon Aug 4 22:44:24 CEST 2014

Hi all,

       I have a question concerning the normalization factors in edgeR. I
think that if we know which genes are none-DE genes in advance, we could
calculate "true" normalization factors based on those information. Given
"true normalization factors", edgeR could find more right DE genes, or even
find all the right DE genes when the simulation is simple.
       The procedures I used in edgeR are:
           library(edgeR)
           dge <- DGEList(counts = counts, group = group )
           norm <- calcNormFactors(dge,method="TMM")
           d <- estimateGLMCommonDisp(norm, design = design)
           d <- estimateGLMTrendedDisp(d,design=design)
           d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10)
           f <- glmFit(d, design = design)
           lr <- glmLRT(f, coef=2)
           pval = lr$table$PValue
           padj = p.adjust(pval, "BH")
           cbind(pval = pval, padj = padj)
        I used the order of p-value or adjust p-value to label DE genes
(for example, the first 300 genes in the order of p value from low to
hight, in a scenario with 30% DE in 1000 genes). In a simple simulation
with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80
wrong DE genes in 300 DE genes. The simulation method is in the attachment,
and I used 10 genes as a demonstration.
        Then I tried two methods to calculate "true" normalization
factors". The first one is using calnormFactors() function, but using
counts from all none-DE genes. The second one is taking log of all the
none-DE counts, and then calculate the median of
log(counts)[,i]-log(counts)[,1], where i is the index of each individual
(each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm)))
to make sure that all factors multiple to one.
         After that, I replaced the "norm" in third line of above codes
with the "true normalization factors". However, both methods would find 110
wrong DE genes, which is higher than the false discovery rates of default
edgeR method (80 wrong DE genes). I am wondering what is a better procedure
of finding the "true" normalization factors in edgeR, given which genes are
none-DE?
         Thank you!

Best regards,
Tianyu Zhan