[BioC] Use of RMA in increasingly-sized datasets

Fri Jun 3 10:07:46 CEST 2005

Hi

This is not a "how do I process 1000 chips with RMA" but rather 
something slightly different.

We're starting to get projects coming thru our Affy core that involve 
1000+ chips.   Obviously we can use MAS5 to process the .cel files, and 
irrespective of what happens with subsequent chips in the project the 
expression values from those chips will stay the same because of the 
single-chip nature of the algorithm.

It would be nice to run, in parallel, RMA-style processing of the data. 
  The issue this raises for me relates to the desire of the scientists 
to look at their data before the end of the project (e.g. you'd want to 
explore the first 200 cancer samples rather than wait for all 1000 to 
be done), which is understandable.   My concern is that the multi-chip 
nature of RMA means that, for any specific .cel file, the expression 
values will depend on the other chips included in the run, and so the 
expression values from that .cel file will be different in the early 
stages (200 chips) and at the end (1000 chips).  Such a 'moving target' 
dataset may be confusing and would certainly cause an audit headache.

Has anyone explored this issue and proposed a solution?   It's entirely 
possible that I am being totally paranoid and that after 100+ chips in 
a dataset the expression values plateau out and are stable in the face 
of additional .cel files being included;   I don't yet have access to 
big-enough datasets to critically address that.  I do have some 
recollection in the deep mists of time a comment (?from Ben Bolstad?) 
suggesting the use of a standard 'training set' of (say) 50 chips, to 
which you would add your new chips one at a time and process.

All comments, thoughts, or experiences gratefully received!

Regards

David

Prof David Kipling
Department of Pathology
School of Medicine
Cardiff University
Heath Park
Cardiff CF14 4XN

Tel:  029 2074 4847
Email:  KiplingD at cardiff.ac.uk