[BioC] "True" normalization factors in edgeR
Mark Robinson
mark.robinson at imls.uzh.ch
Sun Aug 10 21:38:33 CEST 2014
Hi Tianyu,
That's much clearer, thanks.
In calNorm(), I changed:
counts_edgeR<-counts[indNonDE,]
de <- DGEList(counts = counts_edgeR, group = group )
de <- calcNormFactors(de,method="TMM")
to:
de <- DGEList(counts = counts, group = group )
de1 <- calcNormFactors(de[indNonDE,],method="TMM")
de$samples$norm.factors <- de1$samples$norm.factors
And, the results improve greatly. I'll leave it as an exercise to you to fix the other method you came up with.
Basically, edgeR multiplies the two factors together ($lib.size and $norm.factors from the $samples element) to make the library size. You were actually changing them both (since you created a DGEList object with new library sizes), when I think what you want is to just change the relative factors.
See also the following Section in the edgeR user's guide:
(http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)
2.6.3 RNA composition
[..]
We call the product of the original library size and the scaling factors the effective library size. The effective library size replaces the
original library size in all downsteam analyses.
BTW, I haven't followed your calculations, but you may also be interested in the following manuscript, which proposes a similar thing:
http://www.biomedcentral.com/1471-2105/14/219/abstract
Hope that helps,
Mark
----------
Prof. Dr. Mark Robinson
Statistical Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://ow.ly/riRea
On 10.08.2014, at 18:38, Zhan Tianyu <sewen67 at gmail.com> wrote:
> Hi Mark:
> Hello,
> Sorry for the unclear demonstration in the previous email. My question is how to find "true" normalization factors in edgeR given non-DE genes? I think that if edgeR knows non-DE genes, it would have lower false positive rates (find fewer wrong DE genes). By studying this, I want to have a better understanding of how edgeR calculates normalization factors.
>
> Here is my sessionInfo():
>
> > sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-apple-darwin13.1.0 (64-bit)
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] grid splines parallel stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] gdata_2.13.3 tm_0.6 NLP_0.1-3 gridExtra_0.9.1 tweeDEseqCountData_1.2.0
> [6] Biobase_2.24.0 BiocGenerics_0.10.0 edgeR_3.6.7 limma_3.20.8
> loaded via a namespace (and not attached):
> [1] gtools_3.4.1 slam_0.1-32 tools_3.1.0
>
> Here is a reproducible example:
>
> 1. Run the source code written by Xiaobei and you:
> Pipeline: http://130.60.190.4/robinson_lab/edgeR_robust/pipeline.pdf
> Source: http://130.60.190.4/robinson_lab/edgeR_robust/robust_simulation.R
>
> 2. Simulate NB data:
>
> library(tweeDEseqCountData)
> data(pickrell)
> pickrell<-as.matrix(exprs(pickrell.eset))
> set.seed(3)
> nSamp<-8
> grp <- as.factor(rep(0:1, each = nSamp/2)) ## set up group: 4 vs 4
> data_2 <- NBsim(foldDiff=3,dataset =pickrell, nTags = 1000, group = grp, verbose = TRUE, add.outlier = TRUE,
> outlierMech = "S", pOutlier = 0.1, drop.extreme.dispersion = 0.1,pDiff=0.3,pUp=0.9)
>
> 3. Plot the true DE genes I have put in:
>
> d <- DGEList(counts=data_2$counts, group=grp, lib.size=colSums(data_2$counts))
> rownames(d$counts) <- paste("ids",1:nrow(d$counts),sep="")
> d <- estimateCommonDisp(d)
> true.tags<-rownames(data_2$counts[data_2$indDE,])
> plotSmear(d, de.tags=true.tags)
>
> 4. Introduce two methods of calculating "true" normalization factors with non-DE genes. The first method works on log of non-DE counts. It first calculates the difference of log non-DE counts between each individual (column) and first individual (column). Then I take exponential of the median of each column to get "true" normalization factors. The second method is using calcNormFactors() function, but with non-DE genes.
>
> calNorm<-function(counts,group,indNonDE){
> counts<-counts+1 ## Avoid zero counts issue
> ## First method: Calculate true normalization factors based on median of log counts difference, and stored them in "norm_edgeR_median"
> ## variable.
> norm<-rep(NA,8)
> baseline<-counts[indNonDE,1]
> for (i in 1:length(norm))
> {
> norm_temp<-data_2$counts[indNonDE,i]
> norm[i]<-median(log(norm_temp)-log(baseline))
> }
> norm_edgeR_median<-exp(norm)
> ## Forced all factors multiple to 1
> norm_edgeR_median<-norm_edgeR_median/exp(mean(log(norm_edgeR_median)))
>
> ## Second method: Calculate true normalization factors for edgeR with none-DE genes, using calcNormFactors() function,
> ## and stored them in "norm_edgeR_calc" variable.
> counts_edgeR<-counts[indNonDE,]
> de <- DGEList(counts = counts_edgeR, group = group )
> de <- calcNormFactors(de,method="TMM")
> norm_edgeR_calc<-de$samples$norm.factors
> list(norm_edgeR_median=norm_edgeR_median,norm_edgeR_calc=norm_edgeR_calc)
> }
>
> library(gridExtra)
> norm_factors<-calNorm(data_2$counts,grp,data_2$indNonDE)
>
> edgeR_median.pfun <-
> function(counts, group, design = NULL, mc.cores = 4, prior.df=10)
> {
> ## edgeR standard pipeline ##
> library(edgeR)
> d <- DGEList(counts = counts, group = group )
> d <- calcNormFactors(d,method="TMM")
> ### Use median of log difference of none-DE genes as normalization factors
> d$samples$norm.factors<-norm_factors$norm_edgeR_median
> d <- estimateGLMCommonDisp(d, design = design)
> d <- estimateGLMTrendedDisp(d,design=design)
> d <- estimateGLMTagwiseDisp(d, design = design, prior.df = prior.df)
> f <- glmFit(d, design = design)
> lr <- glmLRT(f, coef=2)
> pval = lr$table$PValue
> padj = p.adjust(pval, "BH")
> cbind(pval = pval, padj = padj)
> }
>
> edgeR_calc.pfun <-
> function(counts, group, design = NULL, mc.cores = 4, prior.df=10)
> {
> ## edgeR standard pipeline ##
> library(edgeR)
> d <- DGEList(counts = counts, group = group )
> d <- calcNormFactors(d,method="TMM")
> ### Use calcNormFactors() with none-DE genes to calculate normalization factors.
> d$samples$norm.factors<-norm_factors$norm_edgeR_calc
> d <- estimateGLMCommonDisp(d, design = design)
> d <- estimateGLMTrendedDisp(d,design=design)
> d <- estimateGLMTagwiseDisp(d, design = design, prior.df = prior.df)
> f <- glmFit(d, design = design)
> lr <- glmLRT(f, coef=2)
> pval = lr$table$PValue
> padj = p.adjust(pval, "BH")
> cbind(pval = pval, padj = padj)
> }
>
> 5. Generate false positive plot:
>
> pvals_2 <- pval(data_2, method =c("edgeR","edgeR_median","edgeR_calc"), count.type = "counts",mc.cores = 4)
> fdPlot(pvals_2, cex = 0.6,xlim=c(0,700))
>
> I think the result is unexpected because edgeR_median and edgeR_calc have higher false positive rate than default edgeR. The reason would be that I am not using a proper way to calculate "true" normalization factors with non-DE genes? Thanks for your time!
>
> Best regards,
> Tianyu Zhan
>
>
> 2014-08-10 7:24 GMT-04:00 Mark Robinson <mark.robinson at imls.uzh.ch>:
> And, please keep the discussion on the list. Others can benefit.
>
> Also, re-read the Posting guide:
> http://www.bioconductor.org/help/mailing-list/posting-guide/
>
> M.
>
>
>
>
> On 10.08.2014, at 13:08, Mark Robinson <mark.robinson at imls.uzh.ch> wrote:
>
> > Hi Tianyu,
> >
> > Good questions get good answers.
> >
> > If you want my help, you should make it easier for me:
> >
> > 1. Answer my original questions. In particular, look at your data to see what normalization factors you've introduced. That may give you some indication of if you've done something wrong.
> >
> > 2. Instead of sending me a bunch of files, tell me exactly what I should look at.
> >
> > Mark
> >
> >
> > ----------
> > Prof. Dr. Mark Robinson
> > Statistical Bioinformatics, Institute of Molecular Life Sciences
> > University of Zurich
> > http://ow.ly/riRea
> >
> >
> >
> >
> >
> >
> >
> > On 09.08.2014, at 20:28, Zhan Tianyu <sewen67 at gmail.com> wrote:
> >
> >> Hi Professor Robinson:
> >> Hello,
> >> Thanks for your response. It is true that in reality we never know what is non-DE genes. However, I want to have a better understanding of edgeR and its performance. I thank that if edgeR knows non-DE genes in advance, it would have lower false positive rates (find more right DE genes).
> >> The attachments contains R scripts for data simulation and false positive plots. You could first run the source code of "R_robust.R", "calNorm.R", and "simulation_normal.R". "R_robust.R" is written by you and Xiaobei, and I also added three functions of edgeR: edgdR_median.pfun, edgeR_calc.pfun, and edgeR_ratio.pfun. All the three methods take advantage of known non-DE genes. The details of how I calculate "true" normalization factors are in the file "calNorm.R" file.
> >> Then I tested the default edgeR, and three functions with known non-DE genes in two simulations: normal simulation ( Nsim() function in the file "simulation_normal.R"), and NB simulation ( NBsim() in the file "R_robust"). However, the false positive plots show that default edgeR performed the best (it has the lowest false positive rates). I think the reason is that I am not using the right way to calculate true normalization factors for edgeR, assuming that I know non-DE genes. I am wondering what is a proper way to find the true normalization factors edgeR with non-DE genes? Thanks for your time!
> >>
> >> Best regards,
> >> Tianyu Zhan
> >>
> >>
> >> 2014-08-09 10:34 GMT-04:00 Mark Robinson <mark.robinson at imls.uzh.ch>:
> >> Hi Tianyu,
> >>
> >> Sorry for the slow response (was away).
> >>
> >> I have read your attachment, but your question and simulation is still a little unclear to me. First, some useful diagnostics for me to see whether this simulation is in any way reasonable to a real RNA-seq dataset are an M-A plot -- plotSmear(), perhaps highlighting the true DE genes you have put in -- or a dispersion-mean plot -- plotBCV(). Also, there is no code in there to show exactly what you did when you say 'I replaced the "norm" in third line of above codes with the "true normalization factors"'; what did you exactly do? Maybe send a completely reproducible example ? Give your sessionInfo() too.
> >>
> >>> I am wondering what is a better procedure
> >>> of finding the "true" normalization factors in edgeR, given which genes are
> >>> none-DE?
> >>
> >> The other concern is that, in practice, you never know what is non-DE. So, how is this useful?
> >>
> >> Cheers, Mark
> >>
> >>
> >> ----------
> >> Prof. Dr. Mark Robinson
> >> Statistical Bioinformatics, Institute of Molecular Life Sciences
> >> University of Zurich
> >> http://ow.ly/riRea
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 04.08.2014, at 22:44, Zhan Tianyu <sewen67 at gmail.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I have a question concerning the normalization factors in edgeR. I
> >>> think that if we know which genes are none-DE genes in advance, we could
> >>> calculate "true" normalization factors based on those information. Given
> >>> "true normalization factors", edgeR could find more right DE genes, or even
> >>> find all the right DE genes when the simulation is simple.
> >>> The procedures I used in edgeR are:
> >>> library(edgeR)
> >>> dge <- DGEList(counts = counts, group = group )
> >>> norm <- calcNormFactors(dge,method="TMM")
> >>> d <- estimateGLMCommonDisp(norm, design = design)
> >>> d <- estimateGLMTrendedDisp(d,design=design)
> >>> d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10)
> >>> f <- glmFit(d, design = design)
> >>> lr <- glmLRT(f, coef=2)
> >>> pval = lr$table$PValue
> >>> padj = p.adjust(pval, "BH")
> >>> cbind(pval = pval, padj = padj)
> >>> I used the order of p-value or adjust p-value to label DE genes
> >>> (for example, the first 300 genes in the order of p value from low to
> >>> hight, in a scenario with 30% DE in 1000 genes). In a simple simulation
> >>> with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80
> >>> wrong DE genes in 300 DE genes. The simulation method is in the attachment,
> >>> and I used 10 genes as a demonstration.
> >>> Then I tried two methods to calculate "true" normalization
> >>> factors". The first one is using calnormFactors() function, but using
> >>> counts from all none-DE genes. The second one is taking log of all the
> >>> none-DE counts, and then calculate the median of
> >>> log(counts)[,i]-log(counts)[,1], where i is the index of each individual
> >>> (each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm)))
> >>> to make sure that all factors multiple to one.
> >>> After that, I replaced the "norm" in third line of above codes
> >>> with the "true normalization factors". However, both methods would find 110
> >>> wrong DE genes, which is higher than the false discovery rates of default
> >>> edgeR method (80 wrong DE genes). I am wondering what is a better procedure
> >>> of finding the "true" normalization factors in edgeR, given which genes are
> >>> none-DE?
> >>> Thank you!
> >>>
> >>> Best regards,
> >>> Tianyu Zhan
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >> <R scripts Tianyu Zhan.zip>
> >
>
>
More information about the Bioconductor
mailing list