[BioC] Best options for cross validation machine learning

Wed Jan 20 17:29:13 CET 2010

Hi,

On Tue, Jan 19, 2010 at 12:09 PM, Sean Davis <seandavi at gmail.com> wrote:
> On Tue, Jan 19, 2010 at 11:11 AM, Daniel Brewer <daniel.brewer at icr.ac.uk> wrote:
>> Hello,
>
> Hi, Dan.
>
>> I have a microarray dataset which I have performed an unsupervised
>> Bayesian clustering algorithm on which divides the samples into four
>> groups.  What I would like to do is:
>> 1) Pick a group of genes that best predict which group a sample belongs to.
>
> Feature selection....
>
>> 2) Determine how stable these prediction sets are through some sort of
>> cross-validation (I would prefer not to divide my set into a training
>> and test set for stage one)
>
> Cross-validation....
>
> Note that for cross-validation, steps 1 and 2 necessarily need to be
> done together.
>
>> These steps fall into the supervised machine learning realm which I am
>> not familiar with and googling around the options seem endless.  I was
>> wondering whether anyone could suggest reasonable well-established
>> algorithms to use for both steps.
>
> Check out the MLInterfaces package.  There are MANY methods that could
> be applied.  It really isn't possible to boil this down to an email
> answer, unfortunately.

While this is absolutely true, one could always offer a simple suggestion :-)

A very easy (for you (Daniel)) thing to do would be to try to use the
glmnet package and perform logistic regression to build several (four)
one-against-all type of classifiers. The nice thing about using glmnet
is that it uses "the lasso" (or elastic) regularizer to help cope with
your (likely) "p >> n" problem, and returns to you a model with few
coefficients that can best-predict in the scenario you've given it.
So, by giving it an "appropriate scenario" you essentially get the
ever-covetted-and-quite-controversial "gene signature" for your
group/phenotype of interest.

You'll of course have to do cross-validation/etc, which as Sean+Kasper
have pointed out is essential and (by definition) that you need to
split your data into (several) training/test sets.

I agree with Kasper's final sentiment as well ... but while you most
likely won't get a patent for some diagnostic indicator (of whatever),
it doesn't mean that the genes in your "signature" won't be
informative for further downstream analysis (eg. to help direct
further bench experiments (after more analysis, of course)).

Lastly, if you extract your expression data into a matrix and are
comfortable working with it that way, you can also look at the
CRAN/caret package for functionality that's similar to MLInterface to
help setup your data to do cross validation, etc. In fact, there is a
nice paper written by the author of the caret package that shows you
how to use caret, which might not hurt to read anyway if this type of
stuff is new to you:

http://www.jstatsoft.org/v28/i05

Hope that helps,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact