[BioC] Machine learning, cross validation and gene selection

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Sep 1 19:06:54 CEST 2010


On Wed, Sep 1, 2010 at 12:05 PM, Daniel Brewer <daniel.brewer at icr.ac.uk> wrote:
> Many thanks for the detailed reply.  That is very informative.  What I
> mean by optimal is the collection of genes that any further studies
> should use.  For example, say I have a cancer/normal dataset and I want
> to find the top 10 genes that will classify the tumour type according to
> an SVM.  I would like to know the set of genes plus SVM parameters that
> could be used in further experiments to see if it could be used as a
> diagnostic test.

Here is another view on this:

The *real* purpose of your leave-one-out/whatever cross validation to
assess how well your model can generalize to unknown data.

During this CV phase for your SVM, for instance, you would take this
opportunity to determine the optimal value for you parameters (maybe
the cost param, or nu, or whatever) -- maybe you could avg. the value
of the best parameter found during each fold (the one that gives the
best classification accuracy(?)) as your "final" parameter(s).

Also, during the CV you will want to see how different each model is
-- not just how well your model's accuracy is on the test set. Maybe
you can look at the concordance of the top features? If they are the
same features, are they weighted equally, etc.

Once you have sufficiently convinced yourself that an SVM with your
type of data, with your fine tuned parameter values can "admirably"
generalize to unseen data, then you have reached the objective of the
cross validation phase.

You could then take *all* of your data and rebuild your model (w/ your
params) and use the model that falls out of this as the hammer you
will use to attack data that is *really* unseen.

Some random comments:

If you are going to use an SVM and are looking to "prune" the features
it selects, you might want to look into L1-penalized SVMs, a lá:
(there's an R package there)

Looking down this alley may also be fruitful:

Another way to do that using "normal" SVMs is to perform recursive
feature elimination ... these are all things you can google :-)

I'm guessing those packages (and papers they lead to) will probably
give you some more information on how you might go about choosing your
"final" model in some principled manner ...

FWIW, I might pursue the "penalized" types of classifiers a bit more
aggressively if I were in the shoes that it sounds like you are
wearing (I'm a big fan of the glmnet package -- which also does
penalized logistic regression) .. but you know what they say about
taking advice found on mailing lists ... ;-)

Hope that was helpful,

Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

More information about the Bioconductor mailing list