[BioC] RMA question

Mon Dec 18 11:54:09 CET 2006

hi james,

briefly, to make new chips comparable to a training data set normalized
with RMA you can do the following:

normalize your training arrays keeping track of:

(1) the means over the ranks used in quantile normalization
(2) the probe effects estimated by the median polish procedure

as the background correction is performed chip-by-chip, you can
transform each test (future) array to be compatible to the training
arrays (and the classifier) with the above information. f() then works
roughly like that:

 * substitute the (ranked) test-expression values by the means over the
ranks from (1) (you're normalized now)

 * calculate a chip-effect (for each  probe set) via subtracting the
probe effect from (2) from each probe set (you're done now)

i can send you the code for the above, in case you are interested.

all the best,

	dennis

Naomi Altman wrote:
> I would say that it depends on how you plan to use the classification function.
> 
> If, in future, you will collect more samples, and use the 
> classification function to classify them, then you need to normalize 
> the test set the same way you will normalize the new arrays.
> How you plan to do this may also affect how you normalize the training set.
> 
> --Naomi
> 
> At 02:53 PM 12/17/2006, Wolfgang Huber wrote:
>> Hi James,
>>
>> this is a general problem of normalization methods that work by adapting
>> arrays in a set to themselves, and not to an independent reference.
>>
>> Option 1 is indeed discredited when you want to get a fair estimate of
>> classification rates, since it does not faithfully simulate the real
>> application where you want to classify a new sample.
>>
>> Option 2 does not work since f contains for each array a number of
>> array-specific, ideosyncratic parameters that reflect hybridization
>> conditions, labeling efficiency, RNA extraction etc. You cannot "learn"
>> them in advance.
>>
>> The option I'd take is to look for a normalization method that
>> normalizes each new array individually (or in sets appropriate to your
>> intended application) to an existing database of reference arrays. I
>> know that various people on this list have been/are working on such
>> methods. But I am probably not up-to-date myself - maybe someone can
>> recommend?
>>
>>   Best wishes
>>   Wolfgang
>>
>> ------------------------------------------------------------------
>> Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
>>
>>
>>> Hi, I have a question for RMA normalization. Since RMA is an across
>>> sample
>> normalization, suppose I have 50 training samples (cel files) and 50
>> test samples (cel files). There are two ways to perform normalization:
>>> 1. Combine all the 100 samples together and use RMA to do
>> normalization. Then train the training set of 50 samples to classify the
>> 50 test samples.
>>> 2. Use the 50 training samples to do RMA, then each cel file is
>> converted to gene expression vector. Suppose the mapping from cel file
>> to expression vector is:
>>> Expression = f(cel). The form of f is determined by the 50 training
>> cel files. Then apply the same mapping to the test cel files.
>>> I would think method 2 is more reasonable and trully blind. However,
>> it is not clear how to determine the function f from the 50 training cel
>> files. method 1 is easy to implement, but it is not trully blind, since
>> the normalization of cel files from training samples actually utilized
>> the information from test cel files.
>>> Could anybody tell me how to determine the function f from the 50
>> training cel files?
>>> Many thanks, James
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> Naomi S. Altman                                814-865-3791 (voice)
> Associate Professor
> Dept. of Statistics                              814-863-7114 (fax)
> Penn State University                         814-865-1348 (Statistics)
> University Park, PA 16802-2111
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>