[BioC] Machine learning, cross validation and gene selection

Daniel Brewer daniel.brewer at icr.ac.uk
Wed Sep 1 18:05:34 CEST 2010


Many thanks for the detailed reply.  That is very informative.  What I
mean by optimal is the collection of genes that any further studies
should use.  For example, say I have a cancer/normal dataset and I want
to find the top 10 genes that will classify the tumour type according to
an SVM.  I would like to know the set of genes plus SVM parameters that
could be used in further experiments to see if it could be used as a
diagnostic test.

Thanks again

Dan

On 01/09/2010 4:48 PM, Vincent Carey wrote:
> Traditionally the purpose of cross-validation is to reduce bias in
> model appraisal.  The "resubstitution estimate" of classification
> accuracy uses the training data to appraise the model derived from the
> training data, and is typically biased; this is the subject of a
> substantial literature.  Cross-validation introduces a series of
> partitions into training and test sets, so that a collection of
> appraisals that are independent of the training data are obtained, and
> these are summarized.  When the training process involves feature
> selection, this should be part of each cross-validation step.  Clearly
> this process leads to a collection of chosen features likely
> possessing different elements for each step. There is no '"final"
> optimal' classifier implied by the procedure, but surveying the
> features chosen at each step may provide insight into commonly
> selected or informative features.  Random forests has a variable
> importance measure derived from a bootstrapping approach similar in
> some respects to cross validation; and a varSelRF package or function
> was discussed in recent list entries.  MLInterfaces package, and
> probably many others such as CMA, provides tools to control and
> interpret cross-validation with embedded feature selection.  Be
> careful what you wish for -- what exactly do you mean by 'optimal
> classifier'?
> 
> On Wed, Sep 1, 2010 at 10:55 AM, Daniel Brewer <daniel.brewer at icr.ac.uk> wrote:
>> Hello,
>>
>> I am getting a bit confused about gene selection and machine learning
>> and I was wondering if you could help me out.  I have a dataset that is
>> classified into two groups and my aim is to get a small number of genes
>> (10-20) in a gene signature that I will in theory be able to apply to
>> over datasets to optimal classify the samples.  As I do not have a test
>> and training set I am using Leave-one-out cross-validation to help
>> determine the robustness.  I have read that one should perform gene
>> selection for each split of the samples i.e.
>>
>> 1) Select one group as the test set
>> 2) On the remainder select genes
>> 3) Apply machine learning algorithm
>> 4) Test whether the test set is correctly classified
>> 5) Go to one
>>
>> If you do this, you might get different genes each time, so how do you
>> get your "final" optimal gene classifier?
>>
>> Many thanks
>>
>> Dan
>>
>> --
>> **************************************************************
>> Daniel Brewer, Ph.D.
>>
>> Institute of Cancer Research
>> Molecular Carcinogenesis
>> Email: daniel.brewer at icr.ac.uk
>> **************************************************************
>>
>> The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.
>>
>> This e-mail message is confidential and for use by the a...{{dropped:2}}
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>

-- 
**************************************************************

Daniel Brewer, Ph.D.

Institute of Cancer Research
Molecular Carcinogenesis
MUCRC
15 Cotswold Road
Sutton, Surrey SM2 5NG
United Kingdom

Tel: +44 (0) 20 8722 4109

**************************************************************

The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.

This e-mail message is confidential and for use by the a...{{dropped:2}}



More information about the Bioconductor mailing list