[R] simplifying randomForest(s)

Tue Sep 16 16:31:19 CEST 2003

Dear Andy,

Thanks a lot for your message.

> This is quite a hazardous game.  We've been burned by this ourselves.  I'll
> send you a paper we submitted on variable selection for random forest
> off-line.  (Those who are interested, let me know.)

Thanks!

>
> The basic problem is that when you select important variables by RF and
> then re-run RF with those variables, the OOB error rate become biased
> downward. As you iterate more times, the "overfitting" becomes more and
> more severe (in the sense that, the OOB error rate will keep decreasing
> while error rate on an independent test set will be flat or increases).  I
> was naïve enough to ask Breiman about this, and his reply was something
> like "any competent statistician would know that you need something like
> cross-validation to do that"...

Yes, I understand the points you are making. However, I have tried to achieve 
protection against this problem by assessing the leave-one-out 
cross-validation error (LOOCVE) of the complete selection process. And the 
LOOCVE suggests this is working. Within the variable selection routine the 
OOB error rate is biased, but I guess that does not concern me that much, 
because I only use it to guide the selection. However, my final estimate of 
error comes from the LOOCVE.

This is the esqueleton of the alorithm:

n <- length(y)

for(i in 1:n) {
	the.simple.rf <- simplify.the.rf(data = data[-i, ])
	prediction[i] <- predict(the.simple.rf, newdata = data[i, ])
}
loocve <- sum(y != prediction) / n

Thus, the LOOCVE is computed with observations that were never used for the 
simplification of the tree that is predicting them.

[I'll be glad to send my code to anyone interested].

And, the interesting thing with the data set I have tried is that it seems to 
perform reasonably (actually, the LOOCVE of a tree with the reduced set of 
variables is smaller than the LOOCVE of the original tree).

(This is a first shot. I have a small sample size (29) so LOOCV is not that 
bad in terms of computation, although I am aware it can have high variance. I 
guess I could try the .632+ bootstrap method).

Best,

Ramón

>
> Best,
> Andy
>
> > Any suggestions/comments?
> >
> > Best,
> >
> > Ramón
> >
> > --
> > Ramón Díaz-Uriarte
> > Bioinformatics Unit
> > Centro Nacional de Investigaciones Oncológicas (CNIO)
> > (Spanish National Cancer Center)
> > Melchor Fernández Almagro, 3
> > 28029 Madrid (Spain)
> > Fax: +-34-91-224-6972
> > Phone: +-34-91-224-6900
>
> http://bioinfo.cnio.es/~rdiaz
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>
> ---------------------------------------------------------------------------
>--- Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA),
> and/or its affiliates (which may be known outside the United States as
> Merck Frosst, Merck Sharp & Dohme or MSD) that may be confidential,
> proprietary copyrighted and/or legally privileged, and is intended solely
> for the use of the individual or entity named on this message.  If you are
> not the intended recipient, and have received this message in error, please
> immediately return this by e-mail and then delete it.
> ---------------------------------------------------------------------------
>---

-- 
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://bioinfo.cnio.es/~rdiaz