[BioC] Remove batch effect in small RNASeq study (SVA or others?)
Gordon K Smyth
smyth at wehi.EDU.AU
Tue Apr 29 02:06:55 CEST 2014
To add colours to the MDS plot:
plotMDS(logCPM, col=as.numeric(batch))
Gordon
On Tue, 29 Apr 2014, Gordon K Smyth wrote:
> Dear Shirley,
>
> I don't think that scaling in prcomp() is appropriate here.
>
> Better would be:
>
> library(edgeR)
> dge <- DGEList(counts=count)
> dge <- calcNormFactors(dge)
> logCPM <- cpm(dge,log=TRUE,prior.count=5)
> plotMDS(logCPM)
> logCPM <- removeBatchEffect(logCPM,batch=batch)
> plotMDS(logCPM)
>
> Best wishes
> Gordon
>
>
> On Mon, 28 Apr 2014, shirley zhang wrote:
>
>> Dear Dr. Smyth,
>>
>> Thanks for your reply. Here is the code sequence I used to remove the batch
>> effect and to make the PCA plot.
>>
>> First, getting raw counts for each feature per sample using htseq-count
>> (~64K features by using Ensemble.gtf)
>> Then, getting a count data.matrix by combing all samples together (64K
>> rows, and 14 columns). 8 samples from batch1, and 6 samples from batch2.
>>
>>> count = cbind(count.s1, count.s2, ...., count.s14)
>>
>> #remove the batch effect
>> library(edgeR)
>> batch = as.factor(c(rep("b1", 8), rep("b2", 6)))
>>
>> logCPM <- cpm(count,log=TRUE,prior.count=5)
>> logCPM <- removeBatchEffect(logCPM, batch=batch)
>>
>> #PCA
>> B.res = prcomp(logCPM, scale = TRUE)
>> pc.s = summary(.Bres)$importance[2,1:2]
>> pc1.var = round(pc.s[["PC1"]],2)
>> pc2.var = round(pc.s[["PC2"]],2)
>>
>> pdf(file = "all.count.logCPM.rmBatch.pca")
>> plot(B.res$rotation[,1:2], main = maintitle, xlab = paste("PC1 (variance:
>> ",pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance: ",pc2.var*100,
>> "%)", sep =""))
>> dev.off()
>>
>>> sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] edgeR_3.4.0 limma_3.18.7
>>
>>
>> Many thanks.
>> Shirley
>>
>>
>>
>>
>> On Mon, Apr 28, 2014 at 6:31 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>
>>>
>>> Date: Sun, 27 Apr 2014 20:32:55 -0400
>>>> From: shirley zhang <shirley0818 at gmail.com>
>>>> To: Gordon K Smyth <smyth at wehi.edu.au>
>>>> Cc: Bioconductor mailing list <bioconductor at r-project.org>
>>>> Subject: Re: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>> others?)
>>>>
>>>> Dear Dr. Smyth,
>>>>
>>>> Thank you very much for your quick reply. I did as you suggested by first
>>>> getting log CPM value, then call removeBatchEffect(). I found the PCA
>>>> figure looks better than before, but there is still a batch effect.
>>>>
>>>
>>> I don't see how there could still be a batch effect.
>>>
>>> Please give the code sequence you used to remove the batch effect and to
>>> make the PCA plot.
>>>
>>> I attached two PCA figures. One is based on log10(raw count) which is
>>>> before calling cpm() and removeBatchEffect(). Another one is after.
>>>>
>>>> Could you look at them and give me more suggestions. Will a quantile
>>>> normalization across samples be a good idea since CPM() is still a
>>>> normalization only within each sample??
>>>>
>>>
>>> You should normalize the data before using removeBatchEffect(), and
>>> quantile is one possibility.
>>>
>>> Gordon
>>>
>>> Thanks again for your help,
>>>> Shirley
>>>>
>>>>
>>>> On Sun, Apr 27, 2014 at 6:54 PM, Gordon K Smyth <smyth at wehi.edu.au>
>>>> wrote:
>>>>
>>>> Dear Shirley,
>>>>>
>>>>> I would probably do it like this:
>>>>>
>>>>> library(edgeR)
>>>>> logCPM <- cpm(y,log=TRUE,prior.count=5)
>>>>> logCPM <- removeBatchEffect(logCPM, batch=batch)
>>>>>
>>>>> Best wishes
>>>>> Gordon
>>>>>
>>>>> Date: Sat, 26 Apr 2014 10:51:23 -0400
>>>>>
>>>>>> From: shirley zhang <shirley0818 at gmail.com>
>>>>>> To: Bioconductor Mailing List <bioconductor at stat.math.ethz.ch>
>>>>>> Subject: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>>>> others?)
>>>>>>
>>>>>> I have a RNASeq paired-end data from two batches (8 samples from
>>>>>> batch1,
>>>>>> and 7 samples from batch2). After alignment using TopHat2, then I got
>>>>>> count
>>>>>> using HTseq-count, and FPKM value via Cufflinks. A big batch effect can
>>>>>> be
>>>>>> viewed in PCA using both log10(raw count) and log10(FPKM) value.
>>>>>>
>>>>>> I can NOT use the block factor in edgeR to remove batch effect since I
>>>>>> need
>>>>>> to first obtain residuals after adjusting for batch effect, then test
>>>>>> the
>>>>>> residuals for hundreds of thousands of SNPs (eQTL analysis).
>>>>>>
>>>>>> My question is how to remove batch effect without using edgeR:
>>>>>>
>>>>>> 1. is SVA ok for such a small sample size (N=15)?
>>>>>> 2. If SVA does not work, any other suggestions?
>>>>>>
>>>>>> Many thanks,
>>>>>> Shirley
>>>>>>
>>>>>
>>> ______________________________________________________________________
>>> The information in this email is confidential and intended solely for the
>>> addressee.
>>> You must not disclose, forward, print or use it without the permission of
>>> the sender.
>>> ______________________________________________________________________
>>>
>>
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list