[BioC] RMA question

Naomi Altman naomi at stat.psu.edu
Mon Dec 18 00:35:57 CET 2006

I would say that it depends on how you plan to use the classification function.

If, in future, you will collect more samples, and use the 
classification function to classify them, then you need to normalize 
the test set the same way you will normalize the new arrays.
How you plan to do this may also affect how you normalize the training set.


At 02:53 PM 12/17/2006, Wolfgang Huber wrote:
>Hi James,
>this is a general problem of normalization methods that work by adapting
>arrays in a set to themselves, and not to an independent reference.
>Option 1 is indeed discredited when you want to get a fair estimate of
>classification rates, since it does not faithfully simulate the real
>application where you want to classify a new sample.
>Option 2 does not work since f contains for each array a number of
>array-specific, ideosyncratic parameters that reflect hybridization
>conditions, labeling efficiency, RNA extraction etc. You cannot "learn"
>them in advance.
>The option I'd take is to look for a normalization method that
>normalizes each new array individually (or in sets appropriate to your
>intended application) to an existing database of reference arrays. I
>know that various people on this list have been/are working on such
>methods. But I am probably not up-to-date myself - maybe someone can
>   Best wishes
>   Wolfgang
>Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
> > Hi, I have a question for RMA normalization. Since RMA is an across
> > sample
>normalization, suppose I have 50 training samples (cel files) and 50
>test samples (cel files). There are two ways to perform normalization:
> > 1. Combine all the 100 samples together and use RMA to do
>normalization. Then train the training set of 50 samples to classify the
>50 test samples.
> > 2. Use the 50 training samples to do RMA, then each cel file is
>converted to gene expression vector. Suppose the mapping from cel file
>to expression vector is:
> > Expression = f(cel). The form of f is determined by the 50 training
>cel files. Then apply the same mapping to the test cel files.
> >
> > I would think method 2 is more reasonable and trully blind. However,
>it is not clear how to determine the function f from the 50 training cel
>files. method 1 is easy to implement, but it is not trully blind, since
>the normalization of cel files from training samples actually utilized
>the information from test cel files.
> > Could anybody tell me how to determine the function f from the 50
>training cel files?
> >
> > Many thanks, James
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>Search the archives: 

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111

More information about the Bioconductor mailing list