[BioC] Remove batch effect in small RNASeq study (SVA or others?)

Gordon K Smyth smyth at wehi.EDU.AU
Tue Apr 29 02:00:59 CEST 2014


Dear Shirley,

I don't think that scaling in prcomp() is appropriate here.

Better would be:

  library(edgeR)
  dge <- DGEList(counts=count)
  dge <- calcNormFactors(dge)
  logCPM <- cpm(dge,log=TRUE,prior.count=5)
  plotMDS(logCPM)
  logCPM <- removeBatchEffect(logCPM,batch=batch)
  plotMDS(logCPM)

Best wishes
Gordon


On Mon, 28 Apr 2014, shirley zhang wrote:

> Dear Dr. Smyth,
>
> Thanks for your reply. Here is the code sequence I used to remove the batch
> effect and to make the PCA plot.
>
> First,  getting raw counts for each feature per sample using htseq-count
> (~64K features by using Ensemble.gtf)
> Then, getting a count data.matrix by combing all samples together (64K
> rows, and 14 columns). 8 samples from batch1, and 6 samples from batch2.
>
>> count = cbind(count.s1, count.s2, ...., count.s14)
>
> #remove the batch effect
> library(edgeR)
> batch = as.factor(c(rep("b1", 8), rep("b2", 6)))
>
> logCPM <- cpm(count,log=TRUE,prior.count=5)
> logCPM <- removeBatchEffect(logCPM, batch=batch)
>
> #PCA
> B.res = prcomp(logCPM, scale = TRUE)
> pc.s = summary(.Bres)$importance[2,1:2]
> pc1.var = round(pc.s[["PC1"]],2)
> pc2.var = round(pc.s[["PC2"]],2)
>
> pdf(file = "all.count.logCPM.rmBatch.pca")
> plot(B.res$rotation[,1:2], main = maintitle,  xlab = paste("PC1 (variance:
> ",pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance: ",pc2.var*100,
> "%)", sep =""))
> dev.off()
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] edgeR_3.4.0  limma_3.18.7
>
>
> Many thanks.
> Shirley
>
>
>
>
> On Mon, Apr 28, 2014 at 6:31 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>>
>>  Date: Sun, 27 Apr 2014 20:32:55 -0400
>>> From: shirley zhang <shirley0818 at gmail.com>
>>> To: Gordon K Smyth <smyth at wehi.edu.au>
>>> Cc: Bioconductor mailing list <bioconductor at r-project.org>
>>> Subject: Re: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>         others?)
>>>
>>> Dear Dr. Smyth,
>>>
>>> Thank you very much for your quick reply. I did as you suggested by first
>>> getting log CPM value, then call removeBatchEffect(). I found the PCA
>>> figure looks better than before, but there is still a batch effect.
>>>
>>
>> I don't see how there could still be a batch effect.
>>
>> Please give the code sequence you used to remove the batch effect and to
>> make the PCA plot.
>>
>>  I attached two PCA figures. One is based on log10(raw count) which is
>>> before calling cpm() and removeBatchEffect(). Another one is after.
>>>
>>> Could you look at them and give me more suggestions. Will a quantile
>>> normalization across samples be a good idea since CPM() is still a
>>> normalization only within each sample??
>>>
>>
>> You should normalize the data before using removeBatchEffect(), and
>> quantile is one possibility.
>>
>> Gordon
>>
>>  Thanks again for your help,
>>> Shirley
>>>
>>>
>>> On Sun, Apr 27, 2014 at 6:54 PM, Gordon K Smyth <smyth at wehi.edu.au>
>>> wrote:
>>>
>>>  Dear Shirley,
>>>>
>>>> I would probably do it like this:
>>>>
>>>>   library(edgeR)
>>>>   logCPM <- cpm(y,log=TRUE,prior.count=5)
>>>>   logCPM <- removeBatchEffect(logCPM, batch=batch)
>>>>
>>>> Best wishes
>>>> Gordon
>>>>
>>>>  Date: Sat, 26 Apr 2014 10:51:23 -0400
>>>>
>>>>> From: shirley zhang <shirley0818 at gmail.com>
>>>>> To: Bioconductor Mailing List <bioconductor at stat.math.ethz.ch>
>>>>> Subject: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>>>         others?)
>>>>>
>>>>> I have a RNASeq paired-end data from two batches (8 samples from batch1,
>>>>> and 7 samples from batch2). After alignment using TopHat2, then I got
>>>>> count
>>>>> using HTseq-count, and FPKM value via Cufflinks. A big batch effect can
>>>>> be
>>>>> viewed in PCA using both log10(raw count) and log10(FPKM) value.
>>>>>
>>>>> I can NOT use the block factor in edgeR to remove batch effect since I
>>>>> need
>>>>> to first obtain residuals after adjusting for batch effect, then test
>>>>> the
>>>>> residuals for hundreds of thousands of SNPs (eQTL analysis).
>>>>>
>>>>> My question is how to remove batch effect without using edgeR:
>>>>>
>>>>> 1. is SVA ok for such a small sample size (N=15)?
>>>>> 2. If SVA does not work, any other suggestions?
>>>>>
>>>>> Many thanks,
>>>>> Shirley
>>>>>
>>>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________________________________________________
>>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list