[BioC] split arrays

Robert Gentleman rgentlem at fhcrc.org
Thu Sep 29 18:27:05 CEST 2005

  I am not sure what you are really asking but here goes.
References and corresponding R/Bioconductor packages are listed below.

   In my opinion separate normalization and expression estimation is 
essential for different experiments (and by experiment I mean a 
collection of identical arrays processed at about the same time by about 
the same people using about the same protocol; and by identical arrays I 
mean from the same batch). While one can often do fancy things to align 
different arrays prior to processing them it does not seem like a good 
idea at all. When it works, so would separate normalization and when it 
does not work you won't know.

  After you have normalized and estimated expression values then you 
have the gene matching problem. This is not tivial, there are papers 
around that discuss this (Parmigiani et al). There are some issues 
regarding whether you want to make inference at the gene level or the 
sequence level (Unigene is not the same as Entrez Gene). While many have 
ignored the issues that arise (even on a single chip) where the same 
gene has been probed via several different methods, that does not seem 
to be a "best practices".

  If you have no common genes, then life is somewhat easier, you just 
have a bunch more features, and the suggestion to simply use rbind seems 
pretty sensible to me, although there are some potential pitfalls and 
you might want to do some checking to ensure that one set of features is 
not dominating the other for reasons that are not biological.

  If you do have genes in common, then life is harder, the models are 
more complicated and IMHO you want to spend a few hours with a local 
statistician sorting out what questions you want to ask. Essentially, 
considering what the right model is, on a per gene basis is a pretty 
good starting point. As I said there are some papers (Choi et al, 
Gentleman et al), sometimes they come under the heading of 
meta-analysis, and other times simply random effects models. For the 
more statistically inclined I recommend the book by Solomon and Cox 
which directly addresses issues regarding combining microarray experiments.

  Best wishes,

G. Parmigiani, E. Garrett-Mayer, R. Anbazhagan, et al. A cross-study 
comparison of gene
expression studies for the molecular classification of lung cancer. 
Clincal Cancer Research,
10:2922–2927, 2004.
R package: MergeMaid

J. K. Choi, U. Yu, S. Kim, et al. Combining multiple microarray studies 
and modeling
interstudy variation. Bioinformatics, 19, Suppl. 1:i84–i90, 2003.
R package: GeneMeta

D.R. Cox and P. J. Solomon. Components of Variance. Chapman and Hall, 
New York, 2003.

On the Synthesis of Microarray Experiments
R. Gentleman, M. Ruschhaupt and W. Huber,
R package: GeneMetaEx
scholz at Ag.arizona.edu wrote:
> Adrien,
> Thanks for this response. Unfortunately, there are no oligos in common between
> the two arrays. If anyone else has a response to my question (below), I'd like
> to hear it.
> Matt
> Matt,
> I am not familiar with the maize arrays, but I am using the following
> procedure for Affymetrix moe430 split arrays, which have ~160 probesets
> in common between A and B:
> 1) background-correct each chip separately at probe-level
> 2) get a measure of expression at probeset-level
> 3) plot the common probesets against each other for each pair of each
> chips. If you observe the same thing as me, you will see that the trend
> is linear but with intercept != 0 and slope != 1. 
> 4) scale the B chip with those estimated intercept and slope
> Steps 1 and 2 are easily done with rma( , normalize=F).
> Wolfgang Huber and I are currently writing a little package which does
> steps 3 and 4 automatically.
> I'm not sure whether this procedure could make sense or be adapted
> somehow to your maize arrays (do they have enough probes in common?),
> but anyway, some food for thoughts...
> Adrien
>>Recently you advised someone with a split set of maize arrays 
>>that they could do their analysis by reading all the A slides 
>>into an RGList and normalizing, then doing the same with the 
>>B slides, and then combining the two datasets via
>>rbind() of the two MAList objects. I have a similar (the 
>>same?) set of arrays and some of the users of these arrays 
>>have noted that the A and B slides perform differently, i.e. 
>>more background on the B slide, for whatever reason. Though 
>>I'm not actually convinced this is true, it makes me wonder 
>>whether the two datasets should be combined at all since 
>>there may be a "between array set"
>>source of variation. Am I right to segregate these sets or is 
>>there some overwhelming benefit to combining them? I'm no 
>>statistician and would appreciate your take.
> Matt
> ---------------------------------------------
> College of Agriculture and Life Sciences Web Mail.
> http://ag.arizona.edu
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor

Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
rgentlem at fhcrc.org

More information about the Bioconductor mailing list