[BioC] feature selection

Tue Mar 25 08:36:35 MET 2003

First some disclaimer:

1. I don't work with gene expression data, so lack the insights that others
have.
2. I maintain the randomForest package, and use it a lot, so count on me
being biased.

Now, if Karen's objective is finding differentially expressed genes, I agree
that randomForest is an overkill.  However, for classification as well as
data exploration, randomForest can be a very handy tool.  What we have
found, through both simulated and real (non-genomic) data, is that the
variable importance measures can be very effective.  I don't see anything
wrong with using it to identify potentially "interesting" genes.

There are some points to keep in mind, though:

1.  We had found "measure 1" of variable importance to be uninformative in
some situations, and not very stable even with large number of trees.  Leo
had decided to abandon measures 1 and 3.  In the next version of the
package, only measures 2 and 4 are computed.  Both of these are quite stable
(with, say, 500 or more trees).

2.  In most cases that we have seen, randomForest is extremely tolerant of
noise variables, in the sense that the cross-validated error rates do not
improve significantly as number of variables are reduced, for data sets
where we know there are large number of noise variables.  While reducing
number of variables may be a necessity for other classifiers, it doesn't
affect RF much most of the time.

3.  Considering #2 above, the value of the importance measures is really
mostly for "inpterpretation" or exploration.  There's an obvious drawback,
though:  The measures do not give any hints on trend/directions.  To gain
further insight on the structure of the data, one should use the information
provided by variable importance and carry out further exploration with other
tools (e.g., fit more "interpretable" models using the most important
variables, but be careful not to read too much into performance of such
models, as selection bias had crept in).

That's my $0.02 for the day...

Andy

> -----Original Message-----
> From: Nicholas Lewin-Koh [mailto:nikko at hailmail.net]
> Sent: Monday, March 24, 2003 10:52 PM
> To: Karen.Chancellor at asu.edu
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re:[BioC] feature selection
> 
> 
> Hi Karen,
> I don't know that starting with randomForest and using the importance
> values is the best way to start. I would suggest first filtering the
> data in different ways, like 200 largest F values. If your question is
> to identify differentially expressed genes than you really want a
> multiple comparisons approach. The multcomp package is quite good. If
> the interest is a classification rule try filtering in different ways,
> as suggested above, and then try some exploratory 
> discriminant analysis.
> I have gotten good results with the fda function in the mda package on
> CRAN. Use the gen.ridge method option and that gives penalized
> discriminant analysis. This can help to look at the 
> projections and just
> determine if the states are seperable. You can also look at the
> coefficients for each variable. After some careful EDA than go for the
> classification.
> 
> Nicholas  
> 
> 
> Karen writes>
> Hello Bioconductor folk,
> Can any of the bioconductor packages be used on a .pcl file, 
> rather than
> starting with the raw data?
> I am starting with a .pcl file containing approximately 900 
> genes and 50
> samples, which I have read using read.table. The classification is
> known, and
> there are 3 classes of samples. I am interested in reducing the number
> of
> genes. I would like to use the R RandomForest package for this task. 
> Is this appropriate? I'm new to this so will appreciate any help.
> 
> Thanks
> Karen
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
> 

------------------------------------------------------------------------------