[BioC] RNA-Seq, generate batch-free count matrix

Mon Jul 21 02:16:43 CEST 2014

Hi Brian,

> Date: Sat, 19 Jul 2014 23:01:05 -0400
> From: Brian Haas <bhaas at broadinstitute.org>
> To: bioconductor at r-project.org
> Subject: [BioC] RNA-Seq, generate batch-free count matrix
>
> Greetings all,
>
> I've been researching ways to remove batch effects from RNA-Seq count
> matrices.  Basically, I'm starting with a counts matrix that includes batch
> effects, and want to generate a new matrix of counts that has the batch
> effects removed.

In a literal sense, getting a matrix of batch corrected counts is not 
possible.  Once the batch effects have been removed, the values will no 
longer be counts.

To batch correct, it is necessary to first transform the counts to a 
pseudo-continuous scale.  Then you can use batch correction methods 
developed for microarrays.  This is how we usually do it.

First, put the counts in a DGEList object:

   library(edgeR)
   y <- DGEList(counts=counts)

Filter non-expressed genes:

   A <- aveLogCPM(y)
   y2 <- y2[A>1,]

Then normalize and compute log2 counts-per-million with an offset:

   y2 <- calcNormFactors(y2)
   logCPM <- cpm(y2, log=TRUE, prior.count=5)

Then remove batch correct:

   logCPMc <- removeBatchEffect(y, batch)

Here batch is a vector or factor taking a different value for each batch 
group.  You can input two batch vectors.

Now you can cluster the samples, for example by:

   plotMDS(logCPMc)

Variations on this would be use rpkm() instead of cpm(), or to give 
removeBatchEffect() a design matrix of known groups that are not batch 
effects.

Best wishes
Gordon

> I'm looking to apply this to sets of RNA-Seq samples (~100 samples) that
> were sequenced in batches on different days (factor) and for which I also
> have other metadata with continuous values (covariates such as total
> sequenced reads in each sample, quality metrics, etc).   I want to study
> all these samples in an unsupervised manner, and don't have a model for
> anything but the various batch effects that I want removed (ie. no cancer
> vs. normal labeling, instead they're all 'normal' and I'd like to see if
> they form clusters based on natural variation in the population, and
> perhaps identify subtypes).
>
>> From what I've read thus far, methods like sva (and the included Combat)
> require that you provide a model for the covariates that you do not want
> removed (biological factors) in addition to the ones you do want removed
> (batch effects).   Is it not possible to use these methods in my scenario,
> where I don't have factors other than the specified batch effects?
>
> In searching the bioconductor mailing list archive, I found:
>
>       edgeR package, removeBatchEffect() function
>
> which seems to do exactly what I want, and I'll experiment with it shortly.
> I'm mostly curious about what other methods might be available to do this,
> and whether the SVA or other libraries contain functions that I should
> explore.
>
> Many thanks in advance for any advice!
>
> ~brian

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}