[BioC] Remove batch effect in small RNASeq study (SVA or others?)

Gordon K Smyth smyth at wehi.EDU.AU
Tue Apr 29 02:06:55 CEST 2014


To add colours to the MDS plot:

  plotMDS(logCPM, col=as.numeric(batch))

Gordon

On Tue, 29 Apr 2014, Gordon K Smyth wrote:

> Dear Shirley,
>
> I don't think that scaling in prcomp() is appropriate here.
>
> Better would be:
>
> library(edgeR)
> dge <- DGEList(counts=count)
> dge <- calcNormFactors(dge)
> logCPM <- cpm(dge,log=TRUE,prior.count=5)
> plotMDS(logCPM)
> logCPM <- removeBatchEffect(logCPM,batch=batch)
> plotMDS(logCPM)
>
> Best wishes
> Gordon
>
>
> On Mon, 28 Apr 2014, shirley zhang wrote:
>
>> Dear Dr. Smyth,
>> 
>> Thanks for your reply. Here is the code sequence I used to remove the batch
>> effect and to make the PCA plot.
>> 
>> First,  getting raw counts for each feature per sample using htseq-count
>> (~64K features by using Ensemble.gtf)
>> Then, getting a count data.matrix by combing all samples together (64K
>> rows, and 14 columns). 8 samples from batch1, and 6 samples from batch2.
>> 
>>> count = cbind(count.s1, count.s2, ...., count.s14)
>> 
>> #remove the batch effect
>> library(edgeR)
>> batch = as.factor(c(rep("b1", 8), rep("b2", 6)))
>> 
>> logCPM <- cpm(count,log=TRUE,prior.count=5)
>> logCPM <- removeBatchEffect(logCPM, batch=batch)
>> 
>> #PCA
>> B.res = prcomp(logCPM, scale = TRUE)
>> pc.s = summary(.Bres)$importance[2,1:2]
>> pc1.var = round(pc.s[["PC1"]],2)
>> pc2.var = round(pc.s[["PC2"]],2)
>> 
>> pdf(file = "all.count.logCPM.rmBatch.pca")
>> plot(B.res$rotation[,1:2], main = maintitle,  xlab = paste("PC1 (variance:
>> ",pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance: ",pc2.var*100,
>> "%)", sep =""))
>> dev.off()
>> 
>>> sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> 
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] edgeR_3.4.0  limma_3.18.7
>> 
>> 
>> Many thanks.
>> Shirley
>> 
>> 
>> 
>> 
>> On Mon, Apr 28, 2014 at 6:31 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>> 
>>>
>>>  Date: Sun, 27 Apr 2014 20:32:55 -0400
>>>> From: shirley zhang <shirley0818 at gmail.com>
>>>> To: Gordon K Smyth <smyth at wehi.edu.au>
>>>> Cc: Bioconductor mailing list <bioconductor at r-project.org>
>>>> Subject: Re: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>>         others?)
>>>> 
>>>> Dear Dr. Smyth,
>>>> 
>>>> Thank you very much for your quick reply. I did as you suggested by first
>>>> getting log CPM value, then call removeBatchEffect(). I found the PCA
>>>> figure looks better than before, but there is still a batch effect.
>>>> 
>>> 
>>> I don't see how there could still be a batch effect.
>>> 
>>> Please give the code sequence you used to remove the batch effect and to
>>> make the PCA plot.
>>>
>>>  I attached two PCA figures. One is based on log10(raw count) which is
>>>> before calling cpm() and removeBatchEffect(). Another one is after.
>>>> 
>>>> Could you look at them and give me more suggestions. Will a quantile
>>>> normalization across samples be a good idea since CPM() is still a
>>>> normalization only within each sample??
>>>> 
>>> 
>>> You should normalize the data before using removeBatchEffect(), and
>>> quantile is one possibility.
>>> 
>>> Gordon
>>>
>>>  Thanks again for your help,
>>>> Shirley
>>>> 
>>>> 
>>>> On Sun, Apr 27, 2014 at 6:54 PM, Gordon K Smyth <smyth at wehi.edu.au>
>>>> wrote:
>>>>
>>>>  Dear Shirley,
>>>>> 
>>>>> I would probably do it like this:
>>>>>
>>>>>   library(edgeR)
>>>>>   logCPM <- cpm(y,log=TRUE,prior.count=5)
>>>>>   logCPM <- removeBatchEffect(logCPM, batch=batch)
>>>>> 
>>>>> Best wishes
>>>>> Gordon
>>>>>
>>>>>  Date: Sat, 26 Apr 2014 10:51:23 -0400
>>>>> 
>>>>>> From: shirley zhang <shirley0818 at gmail.com>
>>>>>> To: Bioconductor Mailing List <bioconductor at stat.math.ethz.ch>
>>>>>> Subject: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>>>>         others?)
>>>>>> 
>>>>>> I have a RNASeq paired-end data from two batches (8 samples from 
>>>>>> batch1,
>>>>>> and 7 samples from batch2). After alignment using TopHat2, then I got
>>>>>> count
>>>>>> using HTseq-count, and FPKM value via Cufflinks. A big batch effect can
>>>>>> be
>>>>>> viewed in PCA using both log10(raw count) and log10(FPKM) value.
>>>>>> 
>>>>>> I can NOT use the block factor in edgeR to remove batch effect since I
>>>>>> need
>>>>>> to first obtain residuals after adjusting for batch effect, then test
>>>>>> the
>>>>>> residuals for hundreds of thousands of SNPs (eQTL analysis).
>>>>>> 
>>>>>> My question is how to remove batch effect without using edgeR:
>>>>>> 
>>>>>> 1. is SVA ok for such a small sample size (N=15)?
>>>>>> 2. If SVA does not work, any other suggestions?
>>>>>> 
>>>>>> Many thanks,
>>>>>> Shirley
>>>>>> 
>>>>> 
>>> ______________________________________________________________________
>>> The information in this email is confidential and intended solely for the
>>> addressee.
>>> You must not disclose, forward, print or use it without the permission of
>>> the sender.
>>> ______________________________________________________________________
>>> 
>> 
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list