[BioC] normalization and batch correction across multiple project

Gordon K Smyth smyth at wehi.EDU.AU
Tue Sep 9 04:15:49 CEST 2014


Dear Adai,

The removeBatchEffects() function is written in such a way that it always 
runs without an error.  It makes fewer assumptions than Combat, and just 
does the best it can with whatever data you give it.

Nevertheless, I would question the value of having a batch that contains 
only one sample.  Better to pool this and another batch.

Best wishes
Gordon


On Mon, 8 Sep 2014, Adaikalavan Ramasamy wrote:

> Dear Prof. Smyth,
>
> Apologies for the late reply. Thank you so much for valuable input and yes
> reproducibility of results is paramount. As you indicated, we do sometimes
> have samples originating from different origins (e.g. PBMC vs. PAXgene) or
> different countries where sample collection/storage varies which would
> complicate matters.
>
> I tried removeBatchEffect() and it works well. The example project I tried
> had four batches, one of which consisted of only single sample. How is
> removeBatchEffects() able to deal with single sample batches when ComBat
> does not? Here is an example:
>
> expr   <- matrix( 10 + rnorm(10000*100), nc=100 )
> batch  <- paste0("Batch", rep( c(1, 2, 3, 4), c(33, 33, 33, 1) ) )
> tissue <- sample(c("A", "B", "C"), 100, replace=T)
>
> table(batch, tissue)
> #         tissue
> # batch     A  B  C
> #   Batch1 11  9 13
> #   Batch2 12 11 10
> #   Batch3 14  9 10
> #   Batch4  0  1  0
>
> mod   <- model.matrix( ~ tissue)  ## I want to preserve variation due to
> tissue
>
>
> resid <- ComBat(dat=expr, batch=batch, mod=mod)
> # Found 4 batches
> # Found 2  categorical covariate(s)
> # Standardizing Data across genes
> # Fitting L/S model and finding priors
> # Error in apply(s.data[, i], 1, var, na.rm = T) :
> #   dim(X) must have a positive length
>
>
> out <- removeBatchEffect( expr, batch=batch, design=mod ) ## works!
>
> Should I post this as a separate email? Thank you.
>
> Regards, Adai
>
>
>
> On Thu, Aug 28, 2014 at 2:01 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> We have had to regularly address the same issues that you are facing.
>> There is no blanket answer -- every case needs to be considered on its own
>> merits -- but you seem to be considering the right options.
>>
>> In our work, we generally adjust for the batch in the limma linear model
>> rather than trying to remove it up-front using combat.  Also consider
>> removeBatchEffect().
>>
>> As you say, analysing multiple projects together can help estimate a batch
>> effect.  However this approach will come unstuck if the samples for the
>> projects are very different.  There is another reason why we generally
>> avoid analysing multiple projects together.  The projects will usually need
>> to be submitted eventually to a public repository such as GEO, and the
>> different projects generally have to be submitted independently. Users will
>> not be able to reproduce our normalization and analysis unless the projects
>> are analyzed separately.
>>
>> Best wishes
>> Gordon
>>
>>  On Mon, Aug 18, 2014 at 1:11 PM, Adaikalavan Ramasamy wrote:
>>>
>>> Dear all,
>>>
>>> I would like to appeal to the collective wisdom in this group on how best
>>> to solve this problem of normalization and batch correction.
>>>
>>> We are a service unit for an academic institute and we run several
>>> projects simultaneously. We use Illumina HT12-v4 microarrays which can
>>> take up to 12 different samples per chip. As we QC the data from one
>>> project, the RNA from failed samples can be repeated to include into chips
>>> from another project (rather than running partial chips to avoid wastage).
>>> Sometimes we include samples from other projects also. Here is a simple
>>> illustration
>>>
>>> Chip No       ScanDate    Contents
>>> 1                1st July        *12 samples from project A*
>>> 2                1st July          *8 samples from project A* + 4 from
>>> project B
>>> 3                1st August   12 samples from Project B
>>> 4                1st August     *1 sample from Project A* + 5 samples
>>> from B + 6 from project C
>>> ...
>>>
>>> What is the best way to prepare the final data for *project A*? One
>>> option is to do the following:
>>>
>>>    1. Pool chips 1, 2 and 4 together.
>>>    2. Remove failed samples
>>>    3. Remove samples from other projects.
>>>    4. Normalize using NEQC from limma
>>>    5. Correct for scan date using COMBAT from sva.
>>>
>>> The other option we considered is to omit step 3 (i.e. use other samples
>>> for normalization and COMBAT) and subset at the end.
>>>
>>> I feel this second option allows for better estimation of batch effects
>>> (especially in chip 4). However, sometimes project A and B can be quite
>>> different (e.g. samples derived from different tissues) which might mess
>>> up
>>> the normalization especially if we want to compare project A to B
>>> directly. We
>>> also considered nec() followed by normalizeBetweenArrays with "Tquantile"
>>> but I felt it was too complicated. Anything else to try?
>>>
>>> Thank you.
>>>
>>> --
>>>
>>> Adaikalavan Ramasamy
>>>
>>> Senior Leadership Fellow in Bioinformatics
>>>
>>> Head of the Transcriptomics Core Facility
>>>
>>>
>>>
>>> Email: adaikalavan.ramasamy at ndm.ox.ac.uk
>>>
>>> Office: 01865 287 710
>>>
>>> Mob: 07906 308 465
>>>
>>> http://www.jenner.ac.uk/transcriptomics-facility
>>>
>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________________________________________________
>>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list