[BioC] classification issues - normalization and standardization

Steve Lianoglou mailinglist.honeypot at gmail.com
Mon Jul 18 16:52:39 CEST 2011

Hi Theresa,

On Mon, Jul 18, 2011 at 7:16 AM, Theresa Brandt
<theresabrandt80 at gmail.com> wrote:
> Hello,
>  I use microarrays to create and test a classifier and I have a question
> realeted to this topic. Theoreticaly one cannot use a test set in creating a
> classifier. It is obvious when thinking about selection of differentiatially
> expressed genes and about training.

I'm a bit confused here. If, when you say, "cannot use a test set in
creating a classifier", you mean that you can not use your test set
during the training step of your model, then that's correct to some

People actively swap their data into different classes (training /
testing) when doing things like cross validation (unless you have a
completely separate/different validation set).

But I digress ..

> But what about such steps like
> normalization, non-specific gene selection (for example selection of genes
> with high variance) and standardization? Can I perform this steps on the
> whole dataset? Or should I do it only using the training set? I saw that
> people rather don't care and use the whole dataset to perform this steps but
> I'm not sure if this is really correct.

I wouldn't do much more to all of your data at once other than things
like array/rma normalization.

I think it might get a bit questionable when you are "feature mining"
across all of your data, although there are scenarios like
"transductive learning" that do something like that.

I might try to just remove low-variance genes from your data by only
calculating its variance after you split your data into training/test.
If you were in a, say, 10-fold cross-validation scenario, then I'd be
doing the "variance axe" 10 times.

If you are concerned about how to normalize data you've never seen
before so that you can apply your classifier to it at some later point
after training/model building, you might want to look at "frozen RMA"


which will allow you to normalize new/unseen data in some 'standard way'

Perhaps others can provide better insight. Hope that helps,
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

More information about the Bioconductor mailing list