[BioC] normalization and batch correction across multiple project
Adaikalavan Ramasamy
adaikalavan.ramasamy at gmail.com
Tue Sep 9 09:22:49 CEST 2014
Thank you for the explanation.
To clarify, this batch contains 1 sample from this project and 11 samples
from an unrelated project. So I extracted this one sample and added to
others from this project.
On 9 Sep 2014 03:16, "Gordon K Smyth" <smyth at wehi.edu.au> wrote:
> Dear Adai,
>
> The removeBatchEffects() function is written in such a way that it always
> runs without an error. It makes fewer assumptions than Combat, and just
> does the best it can with whatever data you give it.
>
> Nevertheless, I would question the value of having a batch that contains
> only one sample. Better to pool this and another batch.
>
> Best wishes
> Gordon
>
>
> On Mon, 8 Sep 2014, Adaikalavan Ramasamy wrote:
>
> Dear Prof. Smyth,
>>
>> Apologies for the late reply. Thank you so much for valuable input and yes
>> reproducibility of results is paramount. As you indicated, we do sometimes
>> have samples originating from different origins (e.g. PBMC vs. PAXgene) or
>> different countries where sample collection/storage varies which would
>> complicate matters.
>>
>> I tried removeBatchEffect() and it works well. The example project I tried
>> had four batches, one of which consisted of only single sample. How is
>> removeBatchEffects() able to deal with single sample batches when ComBat
>> does not? Here is an example:
>>
>> expr <- matrix( 10 + rnorm(10000*100), nc=100 )
>> batch <- paste0("Batch", rep( c(1, 2, 3, 4), c(33, 33, 33, 1) ) )
>> tissue <- sample(c("A", "B", "C"), 100, replace=T)
>>
>> table(batch, tissue)
>> # tissue
>> # batch A B C
>> # Batch1 11 9 13
>> # Batch2 12 11 10
>> # Batch3 14 9 10
>> # Batch4 0 1 0
>>
>> mod <- model.matrix( ~ tissue) ## I want to preserve variation due to
>> tissue
>>
>>
>> resid <- ComBat(dat=expr, batch=batch, mod=mod)
>> # Found 4 batches
>> # Found 2 categorical covariate(s)
>> # Standardizing Data across genes
>> # Fitting L/S model and finding priors
>> # Error in apply(s.data[, i], 1, var, na.rm = T) :
>> # dim(X) must have a positive length
>>
>>
>> out <- removeBatchEffect( expr, batch=batch, design=mod ) ## works!
>>
>> Should I post this as a separate email? Thank you.
>>
>> Regards, Adai
>>
>>
>>
>> On Thu, Aug 28, 2014 at 2:01 AM, Gordon K Smyth <smyth at wehi.edu.au>
>> wrote:
>>
>> We have had to regularly address the same issues that you are facing.
>>> There is no blanket answer -- every case needs to be considered on its
>>> own
>>> merits -- but you seem to be considering the right options.
>>>
>>> In our work, we generally adjust for the batch in the limma linear model
>>> rather than trying to remove it up-front using combat. Also consider
>>> removeBatchEffect().
>>>
>>> As you say, analysing multiple projects together can help estimate a
>>> batch
>>> effect. However this approach will come unstuck if the samples for the
>>> projects are very different. There is another reason why we generally
>>> avoid analysing multiple projects together. The projects will usually
>>> need
>>> to be submitted eventually to a public repository such as GEO, and the
>>> different projects generally have to be submitted independently. Users
>>> will
>>> not be able to reproduce our normalization and analysis unless the
>>> projects
>>> are analyzed separately.
>>>
>>> Best wishes
>>> Gordon
>>>
>>> On Mon, Aug 18, 2014 at 1:11 PM, Adaikalavan Ramasamy wrote:
>>>
>>>>
>>>> Dear all,
>>>>
>>>> I would like to appeal to the collective wisdom in this group on how
>>>> best
>>>> to solve this problem of normalization and batch correction.
>>>>
>>>> We are a service unit for an academic institute and we run several
>>>> projects simultaneously. We use Illumina HT12-v4 microarrays which can
>>>> take up to 12 different samples per chip. As we QC the data from one
>>>> project, the RNA from failed samples can be repeated to include into
>>>> chips
>>>> from another project (rather than running partial chips to avoid
>>>> wastage).
>>>> Sometimes we include samples from other projects also. Here is a simple
>>>> illustration
>>>>
>>>> Chip No ScanDate Contents
>>>> 1 1st July *12 samples from project A*
>>>> 2 1st July *8 samples from project A* + 4 from
>>>> project B
>>>> 3 1st August 12 samples from Project B
>>>> 4 1st August *1 sample from Project A* + 5 samples
>>>> from B + 6 from project C
>>>> ...
>>>>
>>>> What is the best way to prepare the final data for *project A*? One
>>>> option is to do the following:
>>>>
>>>> 1. Pool chips 1, 2 and 4 together.
>>>> 2. Remove failed samples
>>>> 3. Remove samples from other projects.
>>>> 4. Normalize using NEQC from limma
>>>> 5. Correct for scan date using COMBAT from sva.
>>>>
>>>> The other option we considered is to omit step 3 (i.e. use other samples
>>>> for normalization and COMBAT) and subset at the end.
>>>>
>>>> I feel this second option allows for better estimation of batch effects
>>>> (especially in chip 4). However, sometimes project A and B can be quite
>>>> different (e.g. samples derived from different tissues) which might mess
>>>> up
>>>> the normalization especially if we want to compare project A to B
>>>> directly. We
>>>> also considered nec() followed by normalizeBetweenArrays with
>>>> "Tquantile"
>>>> but I felt it was too complicated. Anything else to try?
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>>
>>>> Adaikalavan Ramasamy
>>>>
>>>> Senior Leadership Fellow in Bioinformatics
>>>>
>>>> Head of the Transcriptomics Core Facility
>>>>
>>>>
>>>>
>>>> Email: adaikalavan.ramasamy at ndm.ox.ac.uk
>>>>
>>>> Office: 01865 287 710
>>>>
>>>> Mob: 07906 308 465
>>>>
>>>> http://www.jenner.ac.uk/transcriptomics-facility
>>>>
>>>>
>>>
>>> ______________________________________________________________________
>>> The information in this email is confidential and intended solely for the
>>> addressee.
>>> You must not disclose, forward, print or use it without the permission of
>>> the sender.
>>> ______________________________________________________________________
>>>
>>>
>>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:10}}
More information about the Bioconductor
mailing list