[BioC] split arrays

Robert Gentleman rgentlem at fhcrc.org
Thu Sep 29 20:20:43 CEST 2005

  If they have no probes in common, and were applied to the same RNA 
(essentially technical and not biological replicates) then the two 
arrays can be combined into essentially one big matrix. I would do some 
careful study to make sure that there were not major differences between 
the two (for example look at the distribution of expression, variance 
within gene across samples, etc). My approach is generally to ask what 
things should be the same, and then to compare them. If there are big 
differences then you need to figure out how to address them, but if not 
then you can just treat it as if you measured all the features on the 
mRNA samples, which type of array was used is irrelevant.

  I'm not sure I am following the separate linear modeles part. Most of 
what anyone does is gene-at-a-time (you could look at the Category 
package for an alternative), and so you would fit separate linear models 
to genes within arrays and the same between arrays.

  When you have duplicate probes from what are essentially different 
experiments, then I believe you need to think about a random effects model.

  Best wishes,

scholz at Ag.arizona.edu wrote:
> Thanks, Robert. If I am understanding you correctly, you would advocate both
> separate normalization AND separate linear modeling in the case where the two
> arrays come from different batches and have no common probeset, correct? If I
> was reading Gordon's reply to the other gentleman's email correctly, he was
> suggesting separate normalization but not separate linear modeling for the
> datasets. My question, which in retrospect was unclear, was about what the
> advantages/disadvantages were to combining/separating the datasets for linear
> modeling.
> Matt 
>>  I am not sure what you are really asking but here goes.
>>References and corresponding R/Bioconductor packages are listed below.
>>   In my opinion separate normalization and expression estimation is 
>>essential for different experiments (and by experiment I mean a 
>>collection of identical arrays processed at about the same time by about 
>>the same people using about the same protocol; and by identical arrays I 
>>mean from the same batch). While one can often do fancy things to align 
>>different arrays prior to processing them it does not seem like a good 
>>idea at all. When it works, so would separate normalization and when it 
>>does not work you won't know.
>>  After you have normalized and estimated expression values then you 
>>have the gene matching problem. This is not tivial, there are papers 
>>around that discuss this (Parmigiani et al). There are some issues 
>>regarding whether you want to make inference at the gene level or the 
>>sequence level (Unigene is not the same as Entrez Gene). While many have 
>>ignored the issues that arise (even on a single chip) where the same 
>>gene has been probed via several different methods, that does not seem 
>>to be a "best practices".
>>  If you have no common genes, then life is somewhat easier, you just 
>>have a bunch more features, and the suggestion to simply use rbind seems 
>>pretty sensible to me, although there are some potential pitfalls and 
>>you might want to do some checking to ensure that one set of features is 
>>not dominating the other for reasons that are not biological.
>>  If you do have genes in common, then life is harder, the models are 
>>more complicated and IMHO you want to spend a few hours with a local 
>>statistician sorting out what questions you want to ask. Essentially, 
>>considering what the right model is, on a per gene basis is a pretty 
>>good starting point. As I said there are some papers (Choi et al, 
>>Gentleman et al), sometimes they come under the heading of 
>>meta-analysis, and other times simply random effects models. For the 
>>more statistically inclined I recommend the book by Solomon and Cox 
>>which directly addresses issues regarding combining microarray experiments.
>>  Best wishes,
>>    Robert
>>G. Parmigiani, E. Garrett-Mayer, R. Anbazhagan, et al. A cross-study 
>>comparison of gene
>>expression studies for the molecular classification of lung cancer. 
>>Clincal Cancer Research,
>>10:2922–2927, 2004.
>>R package: MergeMaid
>>J. K. Choi, U. Yu, S. Kim, et al. Combining multiple microarray studies 
>>and modeling
>>interstudy variation. Bioinformatics, 19, Suppl. 1:i84–i90, 2003.
>>R package: GeneMeta
>>D.R. Cox and P. J. Solomon. Components of Variance. Chapman and Hall, 
>>New York, 2003.
>>On the Synthesis of Microarray Experiments
>>R. Gentleman, M. Ruschhaupt and W. Huber,
>>R package: GeneMetaEx
>>scholz at Ag.arizona.edu wrote:
>>>Thanks for this response. Unfortunately, there are no oligos in common between
>>>the two arrays. If anyone else has a response to my question (below), I'd like
>>>to hear it.
>>>I am not familiar with the maize arrays, but I am using the following
>>>procedure for Affymetrix moe430 split arrays, which have ~160 probesets
>>>in common between A and B:
>>>1) background-correct each chip separately at probe-level
>>>2) get a measure of expression at probeset-level
>>>3) plot the common probesets against each other for each pair of each
>>>chips. If you observe the same thing as me, you will see that the trend
>>>is linear but with intercept != 0 and slope != 1. 
>>>4) scale the B chip with those estimated intercept and slope
>>>Steps 1 and 2 are easily done with rma( , normalize=F).
>>>Wolfgang Huber and I are currently writing a little package which does
>>>steps 3 and 4 automatically.
>>>I'm not sure whether this procedure could make sense or be adapted
>>>somehow to your maize arrays (do they have enough probes in common?),
>>>but anyway, some food for thoughts...
>>>>Recently you advised someone with a split set of maize arrays 
>>>>that they could do their analysis by reading all the A slides 
>>>>into an RGList and normalizing, then doing the same with the 
>>>>B slides, and then combining the two datasets via
>>>>rbind() of the two MAList objects. I have a similar (the 
>>>>same?) set of arrays and some of the users of these arrays 
>>>>have noted that the A and B slides perform differently, i.e. 
>>>>more background on the B slide, for whatever reason. Though 
>>>>I'm not actually convinced this is true, it makes me wonder 
>>>>whether the two datasets should be combined at all since 
>>>>there may be a "between array set"
>>>>source of variation. Am I right to segregate these sets or is 
>>>>there some overwhelming benefit to combining them? I'm no 
>>>>statistician and would appreciate your take.
>>>College of Agriculture and Life Sciences Web Mail.
>>>Bioconductor mailing list
>>>Bioconductor at stat.math.ethz.ch
>>Robert Gentleman, PhD
>>Program in Computational Biology
>>Division of Public Health Sciences
>>Fred Hutchinson Cancer Research Center
>>1100 Fairview Ave. N, M2-B876
>>PO Box 19024
>>Seattle, Washington 98109-1024
>>rgentlem at fhcrc.org
> ---------------------------------------------
> College of Agriculture and Life Sciences Web Mail.
> http://ag.arizona.edu 

Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
rgentlem at fhcrc.org

More information about the Bioconductor mailing list