[BioC] "True" normalization factors in edgeR

Sun Aug 10 21:38:33 CEST 2014

Hi Tianyu,

That's much clearer, thanks.

In calNorm(), I changed:

            counts_edgeR<-counts[indNonDE,]
            de <- DGEList(counts = counts_edgeR, group = group )
            de <- calcNormFactors(de,method="TMM")

to:

            de <- DGEList(counts = counts, group = group )
            de1 <- calcNormFactors(de[indNonDE,],method="TMM")
            de$samples$norm.factors <- de1$samples$norm.factors

And, the results improve greatly.  I'll leave it as an exercise to you to fix the other method you came up with.

Basically, edgeR multiplies the two factors together ($lib.size and $norm.factors from the $samples element) to make the library size.  You were actually changing them both (since you created a DGEList object with new library sizes), when I think what you want is to just change the relative factors.

See also the following Section in the edgeR user's guide:
(http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)

2.6.3 RNA composition

[..]
We call the product of the original library size and the scaling factors the effective library size. The effective library size replaces the
original library size in all downsteam analyses.

BTW, I haven't followed your calculations, but you may also be interested in the following manuscript, which proposes a similar thing:

http://www.biomedcentral.com/1471-2105/14/219/abstract

Hope that helps,
Mark

----------
Prof. Dr. Mark Robinson
Statistical Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://ow.ly/riRea

On 10.08.2014, at 18:38, Zhan Tianyu <sewen67 at gmail.com> wrote:

> Hi Mark:
>     Hello,
>     Sorry for the unclear demonstration in the previous email. My question is how to find "true" normalization factors in edgeR given non-DE genes? I think that if edgeR knows non-DE genes, it would have lower false positive rates (find fewer wrong DE genes). By studying this, I want to have a better understanding of how edgeR calculates normalization factors. 
> 
>     Here is my sessionInfo():
> 
>     > sessionInfo()
>        R version 3.1.0 (2014-04-10)
>        Platform: x86_64-apple-darwin13.1.0 (64-bit)
>        locale:
>        [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>        attached base packages:
>        [1] grid      splines   parallel  stats     graphics  grDevices utils     datasets  methods   base     
>        other attached packages:
>        [1] gdata_2.13.3             tm_0.6                   NLP_0.1-3                gridExtra_0.9.1          tweeDEseqCountData_1.2.0
>        [6] Biobase_2.24.0           BiocGenerics_0.10.0      edgeR_3.6.7              limma_3.20.8            
>        loaded via a namespace (and not attached):
>        [1] gtools_3.4.1 slam_0.1-32  tools_3.1.0 
> 
>     Here is a reproducible example:
> 
>     1. Run the source code written by Xiaobei and you:
>         Pipeline: http://130.60.190.4/robinson_lab/edgeR_robust/pipeline.pdf
>         Source: http://130.60.190.4/robinson_lab/edgeR_robust/robust_simulation.R
> 
>     2. Simulate NB data:
> 
>         library(tweeDEseqCountData)
>         data(pickrell)
>         pickrell<-as.matrix(exprs(pickrell.eset))
>         set.seed(3)
>         nSamp<-8
>         grp <- as.factor(rep(0:1, each = nSamp/2))       ## set up group: 4 vs 4
>         data_2 <- NBsim(foldDiff=3,dataset =pickrell, nTags = 1000, group = grp, verbose = TRUE, add.outlier = TRUE, 
>                                   outlierMech = "S", pOutlier = 0.1, drop.extreme.dispersion = 0.1,pDiff=0.3,pUp=0.9)
>  
>     3. Plot the true DE genes I have put in:
> 
>        d <- DGEList(counts=data_2$counts, group=grp, lib.size=colSums(data_2$counts))
>        rownames(d$counts) <- paste("ids",1:nrow(d$counts),sep="")
>        d <- estimateCommonDisp(d)
>        true.tags<-rownames(data_2$counts[data_2$indDE,])
>        plotSmear(d, de.tags=true.tags)
> 
>     4. Introduce two methods of calculating "true" normalization factors with non-DE genes. The first method works on log of non-DE counts. It first calculates the difference of log non-DE counts between each individual (column) and first individual (column). Then I take exponential of the median of each column to get "true" normalization factors. The second method is using calcNormFactors() function, but with non-DE genes. 
> 
>         calNorm<-function(counts,group,indNonDE){
>             counts<-counts+1            ## Avoid zero counts issue
>             ## First method: Calculate true normalization factors based on median of log counts difference, and stored them in "norm_edgeR_median"
>             ## variable. 
>             norm<-rep(NA,8)
>             baseline<-counts[indNonDE,1]
>             for (i in 1:length(norm))
>             {
>                 norm_temp<-data_2$counts[indNonDE,i]
>                 norm[i]<-median(log(norm_temp)-log(baseline))
>             }
>             norm_edgeR_median<-exp(norm)
>             ## Forced all factors multiple to 1
>             norm_edgeR_median<-norm_edgeR_median/exp(mean(log(norm_edgeR_median)))
> 
>             ## Second method: Calculate true normalization factors for edgeR with none-DE genes, using calcNormFactors() function,
>             ## and stored them in "norm_edgeR_calc" variable. 
>             counts_edgeR<-counts[indNonDE,]
>             de <- DGEList(counts = counts_edgeR, group = group )
>             de <- calcNormFactors(de,method="TMM")
>             norm_edgeR_calc<-de$samples$norm.factors
>             list(norm_edgeR_median=norm_edgeR_median,norm_edgeR_calc=norm_edgeR_calc)
>          }
> 
>         library(gridExtra)
>         norm_factors<-calNorm(data_2$counts,grp,data_2$indNonDE)
>  
>        edgeR_median.pfun <-
>            function(counts, group, design = NULL, mc.cores = 4, prior.df=10)
>            {
>                 ## edgeR standard pipeline ##
>                 library(edgeR)
>                 d <- DGEList(counts = counts, group = group )
>                 d <- calcNormFactors(d,method="TMM")
>                 ### Use median of log difference of none-DE genes as normalization factors 
>                 d$samples$norm.factors<-norm_factors$norm_edgeR_median
>                 d <- estimateGLMCommonDisp(d, design = design)
>                 d <- estimateGLMTrendedDisp(d,design=design)
>                 d <- estimateGLMTagwiseDisp(d, design = design, prior.df = prior.df)
>                 f <- glmFit(d, design = design)
>                 lr <- glmLRT(f, coef=2)
>                 pval = lr$table$PValue
>                 padj = p.adjust(pval, "BH")
>                 cbind(pval = pval, padj = padj)
>              }
> 
>         edgeR_calc.pfun <-
>            function(counts, group, design = NULL, mc.cores = 4, prior.df=10)
>            {
>                 ## edgeR standard pipeline ##
>                 library(edgeR)
>                 d <- DGEList(counts = counts, group = group )
>                 d <- calcNormFactors(d,method="TMM")
>                 ### Use calcNormFactors() with none-DE genes to calculate normalization factors. 
>                 d$samples$norm.factors<-norm_factors$norm_edgeR_calc
>                 d <- estimateGLMCommonDisp(d, design = design)
>                 d <- estimateGLMTrendedDisp(d,design=design)
>                 d <- estimateGLMTagwiseDisp(d, design = design, prior.df = prior.df)
>                 f <- glmFit(d, design = design)
>                 lr <- glmLRT(f, coef=2)
>                 pval = lr$table$PValue
>                 padj = p.adjust(pval, "BH")
>                 cbind(pval = pval, padj = padj)
>             }
> 
>        5. Generate false positive plot:
> 
>            pvals_2 <- pval(data_2, method =c("edgeR","edgeR_median","edgeR_calc"), count.type = "counts",mc.cores = 4)
>            fdPlot(pvals_2, cex = 0.6,xlim=c(0,700))
> 
>        I think the result is unexpected because edgeR_median and edgeR_calc have higher false positive rate than default edgeR. The reason would be that I am not using a proper way to calculate "true" normalization factors with non-DE genes? Thanks for your time!
> 
> Best regards,
> Tianyu Zhan 
> 
> 
> 2014-08-10 7:24 GMT-04:00 Mark Robinson <mark.robinson at imls.uzh.ch>:
> And, please keep the discussion on the list.  Others can benefit.
> 
> Also, re-read the Posting guide:
> http://www.bioconductor.org/help/mailing-list/posting-guide/
> 
> M.
> 
> 
> 
> 
> On 10.08.2014, at 13:08, Mark Robinson <mark.robinson at imls.uzh.ch> wrote:
> 
> > Hi Tianyu,
> >
> > Good questions get good answers.
> >
> > If you want my help, you should make it easier for me:
> >
> > 1. Answer my original questions. In particular, look at your data to see what normalization factors you've introduced.  That may give you some indication of if you've done something wrong.
> >
> > 2. Instead of sending me a bunch of files, tell me exactly what I should look at.
> >
> > Mark
> >
> >
> > ----------
> > Prof. Dr. Mark Robinson
> > Statistical Bioinformatics, Institute of Molecular Life Sciences
> > University of Zurich
> > http://ow.ly/riRea
> >
> >
> >
> >
> >
> >
> >
> > On 09.08.2014, at 20:28, Zhan Tianyu <sewen67 at gmail.com> wrote:
> >
> >> Hi Professor Robinson:
> >>     Hello,
> >>     Thanks for your response. It is true that in reality we never know what is non-DE genes. However, I want to have a better understanding of edgeR and its performance. I thank that if edgeR knows non-DE genes in advance, it would have lower false positive rates (find more right DE genes).
> >>     The attachments contains R scripts for data simulation and false positive plots. You could first run the source code of "R_robust.R", "calNorm.R", and "simulation_normal.R". "R_robust.R" is written by you and Xiaobei, and I also added three functions of edgeR: edgdR_median.pfun, edgeR_calc.pfun, and edgeR_ratio.pfun. All the three methods take advantage of known non-DE genes. The details of how I calculate "true" normalization factors are in the file "calNorm.R" file.
> >>     Then I tested the default edgeR, and three functions with known non-DE genes in two simulations: normal simulation ( Nsim() function in the file "simulation_normal.R"), and NB simulation ( NBsim() in the file "R_robust"). However, the false positive plots show that default edgeR performed the best (it has the lowest false positive rates). I think the reason is that I am not using the right way to calculate true normalization factors for edgeR, assuming that I know non-DE genes. I am wondering what is a proper way to find the true normalization factors edgeR with non-DE genes? Thanks for your time!
> >>
> >> Best regards,
> >> Tianyu Zhan
> >>
> >>
> >> 2014-08-09 10:34 GMT-04:00 Mark Robinson <mark.robinson at imls.uzh.ch>:
> >> Hi Tianyu,
> >>
> >> Sorry for the slow response (was away).
> >>
> >> I have read your attachment, but your question and simulation is still a little unclear to me.  First, some useful diagnostics for me to see whether this simulation is in any way reasonable to a real RNA-seq dataset are an M-A plot -- plotSmear(), perhaps highlighting the true DE genes you have put in -- or a dispersion-mean plot -- plotBCV().  Also, there is no code in there to show exactly what you did when you say 'I replaced the "norm" in third line of above codes with the "true normalization factors"'; what did you exactly do?  Maybe send a completely reproducible example ?  Give your sessionInfo() too.
> >>
> >>> I am wondering what is a better procedure
> >>> of finding the "true" normalization factors in edgeR, given which genes are
> >>> none-DE?
> >>
> >> The other concern is that, in practice, you never know what is non-DE.  So, how is this useful?
> >>
> >> Cheers, Mark
> >>
> >>
> >> ----------
> >> Prof. Dr. Mark Robinson
> >> Statistical Bioinformatics, Institute of Molecular Life Sciences
> >> University of Zurich
> >> http://ow.ly/riRea
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 04.08.2014, at 22:44, Zhan Tianyu <sewen67 at gmail.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>>      I have a question concerning the normalization factors in edgeR. I
> >>> think that if we know which genes are none-DE genes in advance, we could
> >>> calculate "true" normalization factors based on those information. Given
> >>> "true normalization factors", edgeR could find more right DE genes, or even
> >>> find all the right DE genes when the simulation is simple.
> >>>      The procedures I used in edgeR are:
> >>>          library(edgeR)
> >>>          dge <- DGEList(counts = counts, group = group )
> >>>          norm <- calcNormFactors(dge,method="TMM")
> >>>          d <- estimateGLMCommonDisp(norm, design = design)
> >>>          d <- estimateGLMTrendedDisp(d,design=design)
> >>>          d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10)
> >>>          f <- glmFit(d, design = design)
> >>>          lr <- glmLRT(f, coef=2)
> >>>          pval = lr$table$PValue
> >>>          padj = p.adjust(pval, "BH")
> >>>          cbind(pval = pval, padj = padj)
> >>>       I used the order of p-value or adjust p-value to label DE genes
> >>> (for example, the first 300 genes in the order of p value from low to
> >>> hight, in a scenario with 30% DE in 1000 genes). In a simple simulation
> >>> with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80
> >>> wrong DE genes in 300 DE genes. The simulation method is in the attachment,
> >>> and I used 10 genes as a demonstration.
> >>>       Then I tried two methods to calculate "true" normalization
> >>> factors". The first one is using calnormFactors() function, but using
> >>> counts from all none-DE genes. The second one is taking log of all the
> >>> none-DE counts, and then calculate the median of
> >>> log(counts)[,i]-log(counts)[,1], where i is the index of each individual
> >>> (each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm)))
> >>> to make sure that all factors multiple to one.
> >>>        After that, I replaced the "norm" in third line of above codes
> >>> with the "true normalization factors". However, both methods would find 110
> >>> wrong DE genes, which is higher than the false discovery rates of default
> >>> edgeR method (80 wrong DE genes). I am wondering what is a better procedure
> >>> of finding the "true" normalization factors in edgeR, given which genes are
> >>> none-DE?
> >>>        Thank you!
> >>>
> >>> Best regards,
> >>> Tianyu Zhan
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >> <R scripts Tianyu Zhan.zip>
> >
> 
>