[R] simplifying randomForest(s)
rdiaz at cnio.es
Tue Sep 16 11:44:21 CEST 2003
I have been using the randomForest package for a couple of difficult
prediction problems (which also share p >> n). The performance is good, but
since all the variables in the data set are used, interpretation of what is
going on is not easy, even after looking at variable importance as produced
by the randomForest run.
I have tried a simple "variable selection" scheme, and it does seem to perform
well (as judged by leave-one-out) but I am not sure if it makes any sense.
The idea is, in a kind of backwards elimination, to eliminate one by one the
variables with smallest importance (or all the ones with negative importance
in one go) until the out-of-bag estimate of classification error becames
larger than that of the previous model (or of the initial model). So nothing
really new. But I haven't been able to find any comments in the literature
about "simplification" of random forests.
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
More information about the R-help