[BioC] Best way to normalize GEO gene expression datasets from different labs/sources?

Thu Feb 16 20:57:18 CET 2012

Ying,

1. For multiple arrays, you have 2 options both use RMA background
correction and quantile normalize to a fixed reference distribution.
The difference is in the summarization. The default summarization will
treat each array individually -- subtracting the frozen "global"
probe-effect and down-weighting probes that show high between- or
within-batch residual variance. Alternatively, if you know the batches
present in your data, you can preprocess each batch separately using
the random_effect summarization. This will allow a batch-specific
change in the global probe-effect (the random effect in the model) for
each batch in your data set. Often the two methods will give you very
similar results.

2. Yes, this option is still available. The frma function uses the
frozen parameter vectors that correspond to the cdfname of your
AffyBatch object. So if you read in the CEL file data with an
alternative CDF, frma will attempt to load the corresponding frmavecs
data package.

3. You need to install the data package you would like to use via biocLite.

4. I don't believe so. I definitely think that it is worthwhile to
examine the preprocessed data for batch effects. fRMA is designed to
address a very specific type of batch-effect -- changes in probe
behavior between batches. There are certainly other ways in which
batch-effects manifest themselves that methods such as SVA are
designed to address.

Hope this helps.

Best,
Matt

On Thu, Feb 16, 2012 at 2:42 PM, ying chen <ying_chen at live.com> wrote:
> Hi Matt,
>
> Thanks a lot for the suggestion.
>
> I read the papers and think frma is perfect for my task. But I still have a
> few questions:
>
> 1) For the multiple arrays, the only summarize method is random_effect,
> right?
>
> 2) In your frmaTools paper (BMC Bioinformatics) you mentioned that the
> latest version of the frma package has the option to use the version 13
> Entrez Gene probe annotation (section 3.2 Alternative CDF). But I could not
> find any method to apply this option in manual frma.pdf (Feb 14,
> 2012) downloaded from Bioconductor frma page. Is this option still
> available?
>
> 3) Is the data file installed automatically when I install firma package or
> I need to install it by myself like biocLite("hgu133plus2frmavecs")?
>
> 4) When you built the reference distribution for U133Plus2, did you pay
> attention to the experiment protocol used for each sample, such as the
> starting RNA type (total RNA or mRNA), the amount of total RNA used (~5ug or
> 10-100ng)? Does it make sense to run SVA after frma to correct for the
> possible batch effects due to different protocols used?
>
> Thanks,
>
> Ying
>
>> Date: Tue, 14 Feb 2012 17:17:48 -0500
>> Subject: Re: [BioC] Best way to normalize GEO gene expression datasets
>> from different labs/sources?
>> From: mccallm at gmail.com
>> To: ying_chen at live.com
>> CC: bioconductor at r-project.org
>
>>
>> Ying,
>>
>> You might consider fRMA:
>> McCall MN, Bolstad BM, and Irizarry RA* (2010). Frozen Robust
>> Multi-Array Analysis (fRMA), Biostatistics, 11(2):242-253.
>> http://bioconductor.org/packages/release/bioc/html/frma.html
>>
>> This preprocessing algorithm was designed to handle such multi-batch
>> analyses.
>>
>> Best,
>> Matt
>>
>> On Tue, Feb 14, 2012 at 4:49 PM, ying chen <ying_chen at live.com> wrote:
>> >
>> >
>> > Hi, I collected dozens of breast cancer GEO datasets (same platform,
>> > Affy U133Plus2) and wonder if there is a way to normalize these datasets so
>> > I can compare the gene expression levels across all the datasets even though
>> > they are from different labs? I think about doing a RMA to all the datasets
>> > together first then followed by SVA to correct for batch effect, or doing
>> > RMAs dataset by dataset then follwed by mean-scaling. Does any of these make
>> > sense? Or what is the best approach? Any suggestion? Thanks a lot for the
>> > help! Ying
>> >        [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>> --
>> Matthew N McCall, PhD
>> 112 Arvine Heights
>> Rochester, NY 14611
>> Cell: 202-222-5880

-- 
Matthew N McCall, PhD
112 Arvine Heights
Rochester, NY 14611
Cell: 202-222-5880