[BioC] Use of RMA in increasingly-sized datasets

Darlene Goldstein Darlene.Goldstein at epfl.ch
Sat Jun 4 18:56:39 CEST 2005


Hi David, BioC list,

apologies in advance for the length of this email........

I have a few things to add to the advice already given, some might also be
relevant to the thread that Ben Bolstad mentioned in his reply:
http://files.protsuggest.org/biocond/html/1816.html

You asked if anyone has looked at this problem.  I have studied 'subset-based'
RMA strategies, including the extrapolation approach (take e.g. 50 chips and
extrapolate that model to get RMA values for the rest of the chips),
partitioning the entire set of chips into manageable size (however many you can
do in a run, like 50), and doing this partitioning multiple times and averaging
to get RMA values.  The 'partitioning' approaches depend on having an entire set
available.

To get an idea of how much RMA values can vary, as well as how inferences might
vary, please see
http://mbi.osu.edu/2004/ws1materials/goldstein.pdf
I have a working ms on this and will be happy to send a preprint when it's
submitted.

You also ask if anyone has a solution.  Unfortunately, I have to say no here (at
least for myself), but I also think that there will not be a general solution.
Rather, the way the issue is approached will depend on the specifics of the
study.  There are many ways to get 1000 chips.  For instance, a lab may process
a bunch of stored samples over a relatively short period of time; alternatively,
the same lab may process samples coming in over a longer period of time, as in a
prospective trial where patients are recruited into the study over time.
Another common possibility is that multiple centers are collaborating on a
larger trial, with each center doing some processing of chips.

There may be different types of problems and artifacts in each of these
scenarios.  For example, the first 50 chips in a study occurring over a period
of time may be qualitatively different from subsequent sets of chips if there is
a time trend for some reason.  In the multi-center case, between lab variability
is likely to be an important artifact.

Ben made the point that what you need are:

1. A consistent normalization step
2. Probe effects estimates made based on a reasonable number of arrays

I could not agree more with 1, however in my opinion there is a problem in how
to get that.  Some people seem to think that quantile normalization of all chips
together will safely remove all artifactual differences between chips.  This is
emphatically _not_ true (and many people are recognizing this).  In an
experiment replicated by the same lab a few months apart (using different
animals each time but following the same protocols in all experimental aspects),
the experimental 'batch' effect persists even if you RMA all chips together. 
This is really easy to see if you just cluster samples based on RMA values - the
major split is between the two replications.  So, if you're hoping to get rid of
this kind of effect merely by RMAing all chips together, I think you are likely
to be disappointed.  I have a preprint of this study if you want more details.

As for 2, I think that the number of arrays is only one component.  The arrays
should also be somehow 'representative'.  In practice, this might be difficult
to achieve.  As you say, if the target is moving then it won't be easy to hit
(as well as cause confusion).

It is not only reasonable but I would also say necessary that the scientists
examine early/preliminary results.  What I would do in this case is RMA the
'preliminary' set together if possible and base early analyses on that.  As more
chips come in, most likely I would re-RMA after 'enough' came in.  However, you
still need to carry out careful exploratory analyses to ensure that you are
really removing the artifacts that you think you are.  What you should look for
depends on the specifics of your study.  Persistent artifacts will need to be
removed by other means (by regression for example).

In the event that you are unable to RMA all your chips together, I would
recommend multiple partitioning to get 'final' RMA values for all chips.  This
is in contrast to extrapolating from a single subset.  Yes, the RMA values will
change, which may be confusing and an audit nightmare, but you will give
yourself some protection against 'locking in' an artifact by averaging over
different sets (which are likely to have different artifacts).  I see this as a
major benefit.

Don't hesitate to write back, on or off list, if any of this seems unclear,

Best regards,

Darlene


On Fri, 2005-06-03 at 09:07 +0100, David Kipling wrote:
> Hi
>
> This is not a "how do I process 1000 chips with RMA" but rather
> something slightly different.
>
> We're starting to get projects coming thru our Affy core that involve
> 1000+ chips.   Obviously we can use MAS5 to process the .cel files, and
> irrespective of what happens with subsequent chips in the project the
> expression values from those chips will stay the same because of the
> single-chip nature of the algorithm.
>
> It would be nice to run, in parallel, RMA-style processing of the data.
>   The issue this raises for me relates to the desire of the scientists
> to look at their data before the end of the project (e.g. you'd want to
> explore the first 200 cancer samples rather than wait for all 1000 to
> be done), which is understandable.   My concern is that the multi-chip
> nature of RMA means that, for any specific .cel file, the expression
> values will depend on the other chips included in the run, and so the
> expression values from that .cel file will be different in the early
> stages (200 chips) and at the end (1000 chips).  Such a 'moving target'
> dataset may be confusing and would certainly cause an audit headache.
>
> Has anyone explored this issue and proposed a solution?   It's entirely
> possible that I am being totally paranoid and that after 100+ chips in
> a dataset the expression values plateau out and are stable in the face
> of additional .cel files being included;   I don't yet have access to
> big-enough datasets to critically address that.  I do have some
> recollection in the deep mists of time a comment (?from Ben Bolstad?)
> suggesting the use of a standard 'training set' of (say) 50 chips, to
> which you would add your new chips one at a time and process.
>
> All comments, thoughts, or experiences gratefully received!
>
> Regards
>
> David
>
>
>
> Prof David Kipling
> Department of Pathology
> School of Medicine
> Cardiff University
> Heath Park
> Cardiff CF14 4XN
>
> Tel:  029 2074 4847
> Email:  KiplingD at cardiff.ac.uk
>
-- 
Darlene Goldstein
École Polytechnique Fédérale de Lausanne (EPFL)
Institut de Mathématiques
Batiment MA, Station 8        Tel: +41 21 693 2552
CH-1015 Lausanne              Fax: +41 21 693 4303
SWITZERLAND



More information about the Bioconductor mailing list