[BioC] Remove batch effect in small RNASeq study (SVA or others?)

Tue Apr 29 08:18:54 CEST 2014

Dear Shirley,

I have only used fold changes based on prior counts for descriptive 
purposes. My understanding is that edgeR uses the counts as is to do 
the statistics (i.e. p-values), but adds in the prior count when 
computing the logFC/logCPM values to avoid taking the log of zero and 
to avoid excessively large fold changes in low-abundance genes. 
However, I do not have a sense of what values are appropriate and I 
usually stick with the default.

-Ryan

On Mon Apr 28 19:16:41 2014, shirley zhang wrote:
> Dear Ryan,
>
> I think you are right. As shown by ?cpm
>
>  ## S3 method for class 'DGEList'
>      cpm(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, ...)
>
> BTW, do you have experience about how to set "prior.count", e.g. 0.25
> vs. 5?
>
> Many thanks,
> Shirley
>
>
>
> On Mon, Apr 28, 2014 at 10:02 PM, Ryan <rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>> wrote:
>
>     Hi Shirley,
>
>     I beleive that "normalized.lib.sizes = TRUE" has become the
>     default when calling the cpm function on a DGEList, so you should
>     not need to specify it.
>
>     -Ryan
>
>
>     On Mon Apr 28 18:59:02 2014, shirley zhang wrote:
>
>           Dear Dr. Smyth,
>
>         Thank you very much for taking your time to look through my
>         codes, and also
>         provided an more approciate code sequence. Thank you!
>
>         By following your codes, I think the batch effect was removed
>         efficiently
>         as shown in the attached figures. Do you agree?
>
>         Furthermore, I found a very useful paper you published in
>         Nature Protocol
>         2013.
>         http://www.nature.com/nprot/__journal/v8/n9/full/nprot.2013.__099.html
>         <http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html>
>
>         In the paper, you wrote:
>         d = DGEList(counts = count, group = samples$condition)
>         d = calcNormFactors(d)
>         nc = cpm(d, normalized.lib.sizes = TRUE)
>
>         My question is:
>         In my case, should I add option "normalized.lib.sizes = TRUE"
>         in cpm()
>         after calling calcNormFactors() like the following:
>         dge <- DGEList(counts=count)
>         dge <- calcNormFactors(dge)
>         logCPM <- cpm(dge,log=TRUE,prior.count=__5,
>         normalized.lib.sizes = TRUE)
>
>         Many thanks for your help,
>         Shirley
>
>
>
>         On Mon, Apr 28, 2014 at 8:06 PM, Gordon K Smyth
>         <smyth at wehi.edu.au <mailto:smyth at wehi.edu.au>> wrote:
>
>             To add colours to the MDS plot:
>
>               plotMDS(logCPM, col=as.numeric(batch))
>
>             Gordon
>
>
>             On Tue, 29 Apr 2014, Gordon K Smyth wrote:
>
>               Dear Shirley,
>
>
>                 I don't think that scaling in prcomp() is appropriate
>                 here.
>
>                 Better would be:
>
>                 library(edgeR)
>                 dge <- DGEList(counts=count)
>                 dge <- calcNormFactors(dge)
>                 logCPM <- cpm(dge,log=TRUE,prior.count=__5)
>                 plotMDS(logCPM)
>                 logCPM <- removeBatchEffect(logCPM,__batch=batch)
>                 plotMDS(logCPM)
>
>                 Best wishes
>                 Gordon
>
>
>                 On Mon, 28 Apr 2014, shirley zhang wrote:
>
>                   Dear Dr. Smyth,
>
>
>                     Thanks for your reply. Here is the code sequence I
>                     used to remove the
>                     batch
>                     effect and to make the PCA plot.
>
>                     First,  getting raw counts for each feature per
>                     sample using htseq-count
>                     (~64K features by using Ensemble.gtf)
>                     Then, getting a count data.matrix by combing all
>                     samples together (64K
>                     rows, and 14 columns). 8 samples from batch1, and
>                     6 samples from batch2.
>
>                       count = cbind(count.s1, count.s2, ...., count.s14)
>
>
>
>                     #remove the batch effect
>                     library(edgeR)
>                     batch = as.factor(c(rep("b1", 8), rep("b2", 6)))
>
>                     logCPM <- cpm(count,log=TRUE,prior.__count=5)
>                     logCPM <- removeBatchEffect(logCPM, batch=batch)
>
>                     #PCA
>                     B.res = prcomp(logCPM, scale = TRUE)
>                     pc.s = summary(.Bres)$importance[2,1:__2]
>                     pc1.var = round(pc.s[["PC1"]],2)
>                     pc2.var = round(pc.s[["PC2"]],2)
>
>                     pdf(file = "all.count.logCPM.rmBatch.pca"__)
>                     plot(B.res$rotation[,1:2], main = maintitle,  xlab
>                     = paste("PC1
>                     (variance:
>                     ",pc1.var*100, "%)", sep =""), ylab = paste("PC2
>                     (variance:
>                     ",pc2.var*100,
>                     "%)", sep =""))
>                     dev.off()
>
>                       sessionInfo()
>
>
>                     R version 3.0.2 (2013-09-25)
>                     Platform: x86_64-unknown-linux-gnu (64-bit)
>
>                     locale:
>                     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>                     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>                     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>                     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>                     [9] LC_ADDRESS=C               LC_TELEPHONE=C
>                     [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>                     attached base packages:
>                     [1] stats     graphics  grDevices utils
>                     datasets  methods   base
>
>                     other attached packages:
>                     [1] edgeR_3.4.0  limma_3.18.7
>
>
>                     Many thanks.
>                     Shirley
>
>
>
>
>                     On Mon, Apr 28, 2014 at 6:31 AM, Gordon K Smyth
>                     <smyth at wehi.edu.au <mailto:smyth at wehi.edu.au>>
>                     wrote:
>
>
>                           Date: Sun, 27 Apr 2014 20:32:55 -0400
>
>                             From: shirley zhang <shirley0818 at gmail.com
>                             <mailto:shirley0818 at gmail.com>>
>                             To: Gordon K Smyth <smyth at wehi.edu.au
>                             <mailto:smyth at wehi.edu.au>>
>                             Cc: Bioconductor mailing list
>                             <bioconductor at r-project.org
>                             <mailto:bioconductor at r-project.org>>
>                             Subject: Re: [BioC] Remove batch effect in
>                             small RNASeq study (SVA or
>                                      others?)
>
>                             Dear Dr. Smyth,
>
>                             Thank you very much for your quick reply.
>                             I did as you suggested by
>                             first
>                             getting log CPM value, then call
>                             removeBatchEffect(). I found the PCA
>                             figure looks better than before, but there
>                             is still a batch effect.
>
>
>                         I don't see how there could still be a batch
>                         effect.
>
>                         Please give the code sequence you used to
>                         remove the batch effect and to
>                         make the PCA plot.
>
>                           I attached two PCA figures. One is based on
>                         log10(raw count) which is
>
>                             before calling cpm() and
>                             removeBatchEffect(). Another one is after.
>
>                             Could you look at them and give me more
>                             suggestions. Will a quantile
>                             normalization across samples be a good
>                             idea since CPM() is still a
>                             normalization only within each sample??
>
>
>                         You should normalize the data before using
>                         removeBatchEffect(), and
>                         quantile is one possibility.
>
>                         Gordon
>
>                           Thanks again for your help,
>
>                             Shirley
>
>
>                             On Sun, Apr 27, 2014 at 6:54 PM, Gordon K
>                             Smyth <smyth at wehi.edu.au
>                             <mailto:smyth at wehi.edu.au>>
>                             wrote:
>
>                               Dear Shirley,
>
>
>                                 I would probably do it like this:
>
>                                    library(edgeR)
>                                    logCPM <- cpm(y,log=TRUE,prior.count=5)
>                                    logCPM <- removeBatchEffect(logCPM,
>                                 batch=batch)
>
>                                 Best wishes
>                                 Gordon
>
>                                   Date: Sat, 26 Apr 2014 10:51:23 -0400
>
>                                   From: shirley zhang
>                                 <shirley0818 at gmail.com
>                                 <mailto:shirley0818 at gmail.com>>
>
>                                     To: Bioconductor Mailing List
>                                     <bioconductor at stat.math.ethz.__ch
>                                     <mailto:bioconductor at stat.math.ethz.ch>>
>                                     Subject: [BioC] Remove batch
>                                     effect in small RNASeq study (SVA or
>                                              others?)
>
>                                     I have a RNASeq paired-end data
>                                     from two batches (8 samples from
>                                     batch1,
>                                     and 7 samples from batch2). After
>                                     alignment using TopHat2, then I got
>                                     count
>                                     using HTseq-count, and FPKM value
>                                     via Cufflinks. A big batch effect
>                                     can
>                                     be
>                                     viewed in PCA using both log10(raw
>                                     count) and log10(FPKM) value.
>
>                                     I can NOT use the block factor in
>                                     edgeR to remove batch effect since
>                                     I
>                                     need
>                                     to first obtain residuals after
>                                     adjusting for batch effect, then test
>                                     the
>                                     residuals for hundreds of
>                                     thousands of SNPs (eQTL analysis).
>
>                                     My question is how to remove batch
>                                     effect without using edgeR:
>
>                                     1. is SVA ok for such a small
>                                     sample size (N=15)?
>                                     2. If SVA does not work, any other
>                                     suggestions?
>
>                                     Many thanks,
>                                     Shirley
>
>
>
>                                 ______________________________________________________________
>
>                         __________
>                         The information in this email is confidential
>                         and intended solely for
>                         the
>                         addressee.
>                         You must not disclose, forward, print or use
>                         it without the permission
>                         of
>                         the sender.
>                         __________________________________________________________________________
>
>
>
>
>             __________________________________________________________________________
>             The information in this email is confidential and intended
>             solely for the
>             addressee.
>             You must not disclose, forward, print or use it without
>             the permission of
>             the sender.
>             __________________________________________________________________________
>
>
>
>             _________________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>             Search the archives:
>             http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>