[BioC] normalization and batch correction across multiple project

Tue Sep 9 09:22:49 CEST 2014

Thank you for the explanation.

To clarify, this batch contains 1 sample from this project and 11 samples
from an unrelated project. So I extracted this one sample and added to
others from this project.
 On 9 Sep 2014 03:16, "Gordon K Smyth" <smyth at wehi.edu.au> wrote:

> Dear Adai,
>
> The removeBatchEffects() function is written in such a way that it always
> runs without an error.  It makes fewer assumptions than Combat, and just
> does the best it can with whatever data you give it.
>
> Nevertheless, I would question the value of having a batch that contains
> only one sample.  Better to pool this and another batch.
>
> Best wishes
> Gordon
>
>
> On Mon, 8 Sep 2014, Adaikalavan Ramasamy wrote:
>
>  Dear Prof. Smyth,
>>
>> Apologies for the late reply. Thank you so much for valuable input and yes
>> reproducibility of results is paramount. As you indicated, we do sometimes
>> have samples originating from different origins (e.g. PBMC vs. PAXgene) or
>> different countries where sample collection/storage varies which would
>> complicate matters.
>>
>> I tried removeBatchEffect() and it works well. The example project I tried
>> had four batches, one of which consisted of only single sample. How is
>> removeBatchEffects() able to deal with single sample batches when ComBat
>> does not? Here is an example:
>>
>> expr   <- matrix( 10 + rnorm(10000*100), nc=100 )
>> batch  <- paste0("Batch", rep( c(1, 2, 3, 4), c(33, 33, 33, 1) ) )
>> tissue <- sample(c("A", "B", "C"), 100, replace=T)
>>
>> table(batch, tissue)
>> #         tissue
>> # batch     A  B  C
>> #   Batch1 11  9 13
>> #   Batch2 12 11 10
>> #   Batch3 14  9 10
>> #   Batch4  0  1  0
>>
>> mod   <- model.matrix( ~ tissue)  ## I want to preserve variation due to
>> tissue
>>
>>
>> resid <- ComBat(dat=expr, batch=batch, mod=mod)
>> # Found 4 batches
>> # Found 2  categorical covariate(s)
>> # Standardizing Data across genes
>> # Fitting L/S model and finding priors
>> # Error in apply(s.data[, i], 1, var, na.rm = T) :
>> #   dim(X) must have a positive length
>>
>>
>> out <- removeBatchEffect( expr, batch=batch, design=mod ) ## works!
>>
>> Should I post this as a separate email? Thank you.
>>
>> Regards, Adai
>>
>>
>>
>> On Thu, Aug 28, 2014 at 2:01 AM, Gordon K Smyth <smyth at wehi.edu.au>
>> wrote:
>>
>>  We have had to regularly address the same issues that you are facing.
>>> There is no blanket answer -- every case needs to be considered on its
>>> own
>>> merits -- but you seem to be considering the right options.
>>>
>>> In our work, we generally adjust for the batch in the limma linear model
>>> rather than trying to remove it up-front using combat.  Also consider
>>> removeBatchEffect().
>>>
>>> As you say, analysing multiple projects together can help estimate a
>>> batch
>>> effect.  However this approach will come unstuck if the samples for the
>>> projects are very different.  There is another reason why we generally
>>> avoid analysing multiple projects together.  The projects will usually
>>> need
>>> to be submitted eventually to a public repository such as GEO, and the
>>> different projects generally have to be submitted independently. Users
>>> will
>>> not be able to reproduce our normalization and analysis unless the
>>> projects
>>> are analyzed separately.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>  On Mon, Aug 18, 2014 at 1:11 PM, Adaikalavan Ramasamy wrote:
>>>
>>>>
>>>> Dear all,
>>>>
>>>> I would like to appeal to the collective wisdom in this group on how
>>>> best
>>>> to solve this problem of normalization and batch correction.
>>>>
>>>> We are a service unit for an academic institute and we run several
>>>> projects simultaneously. We use Illumina HT12-v4 microarrays which can
>>>> take up to 12 different samples per chip. As we QC the data from one
>>>> project, the RNA from failed samples can be repeated to include into
>>>> chips
>>>> from another project (rather than running partial chips to avoid
>>>> wastage).
>>>> Sometimes we include samples from other projects also. Here is a simple
>>>> illustration
>>>>
>>>> Chip No       ScanDate    Contents
>>>> 1                1st July        *12 samples from project A*
>>>> 2                1st July          *8 samples from project A* + 4 from
>>>> project B
>>>> 3                1st August   12 samples from Project B
>>>> 4                1st August     *1 sample from Project A* + 5 samples
>>>> from B + 6 from project C
>>>> ...
>>>>
>>>> What is the best way to prepare the final data for *project A*? One
>>>> option is to do the following:
>>>>
>>>>    1. Pool chips 1, 2 and 4 together.
>>>>    2. Remove failed samples
>>>>    3. Remove samples from other projects.
>>>>    4. Normalize using NEQC from limma
>>>>    5. Correct for scan date using COMBAT from sva.
>>>>
>>>> The other option we considered is to omit step 3 (i.e. use other samples
>>>> for normalization and COMBAT) and subset at the end.
>>>>
>>>> I feel this second option allows for better estimation of batch effects
>>>> (especially in chip 4). However, sometimes project A and B can be quite
>>>> different (e.g. samples derived from different tissues) which might mess
>>>> up
>>>> the normalization especially if we want to compare project A to B
>>>> directly. We
>>>> also considered nec() followed by normalizeBetweenArrays with
>>>> "Tquantile"
>>>> but I felt it was too complicated. Anything else to try?
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>>
>>>> Adaikalavan Ramasamy
>>>>
>>>> Senior Leadership Fellow in Bioinformatics
>>>>
>>>> Head of the Transcriptomics Core Facility
>>>>
>>>>
>>>>
>>>> Email: adaikalavan.ramasamy at ndm.ox.ac.uk
>>>>
>>>> Office: 01865 287 710
>>>>
>>>> Mob: 07906 308 465
>>>>
>>>> http://www.jenner.ac.uk/transcriptomics-facility
>>>>
>>>>
>>>
>>> ______________________________________________________________________
>>> The information in this email is confidential and intended solely for the
>>> addressee.
>>> You must not disclose, forward, print or use it without the permission of
>>> the sender.
>>> ______________________________________________________________________
>>>
>>>
>>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:10}}