[R] some question regarding random forest

Liaw, Andy andy_liaw at merck.com
Tue Mar 2 02:00:06 CET 2004


Rajarshi,

1. You need to be aware that the predicted values returned by randomForest
for the training data are _not_ really model prediction for the training
data.  They are the `out-of-bag' predictions; i.e., each predicted value is
based on only about 36% of the trees in the forest (i.e., when the data
point is left out of the bootstrap sample, thus can be predicted
`honestly').  The out-of-bag prediction provides a convenient `honest'
estimate of prediction error, without resorting to cross-validation or
another layer of bootstrap.  Thus the MSE and R^2 you get from the OOB
prediction will be similar to what you would get out of an independent test
set, or cross-validation.  

If you really want prediction on the training data using _all_ trees in
forest, you need to use predict(), and supply the training data.  You will
see the over-optimistic prediction that way.  (For classification, this
usually gives perfect prediction!)

Increasing number of trees beyond, say, a few hundred, is unlikely to
increase performance.  I usually only do that to get more stable estimate of
variable importance.  RF is relatively resistant to parameter tuning.  I do
not consider number of trees as a tuning parameter, as the theory says the
generalization error converges as the number of trees goes to infinity.
Some people had found that changing nodesize can lead to different
performance.  You may want to check out the tune() function in the package
e1071, which can be used tune randomForest.

Prof. Breiman is working on a paper for Statistical Science.  Stay tuned for
that.

For #2, I don't understand what you mean by `the other pair'.  There are
many ways to measure `variable importance', and they can have very different
interpretations.  The particular ones implemented in randomForest are
explained in the help page, as well as the `manual' Prof. Breiman provided
on his web site (cited in help page).

HTH,
Andy

> From: Rajarshi Guha
> 
> Hi,
>   I had two questions regarding random forests for regression.
> 
> 1) I have read the original paper by Breiman as well as a paper
> dicussing an application of random forests and it appears that the one
> of the nice features of this technique is good predictive ability.
> 
> However I have some data with which I have generated a linear model
> using lm(). I can get an RMS error of 0.43 and an R^2 of 0.62. 
> However when I make a plot of predicted versus observed using the
> randomForest() function the plot is much more scattered (RMS error of
> 0.55 and R^2 of 0.33) than for a similar plot using the linear model.
> (When a test set is supplied to the models the R^2 values are close).
> 
> My question is: should I expect the randomForest to give me similar or
> better results than a simple linear model? In the above case I was
> expecting that for the training data (ie the data with which 
> the random
> forest was built) I would get less scatter in the plot and a 
> lower RMSE.
> (I realize that too much stock should'nt be placed in R^2).
> 
> The papers note that overfitting is not a problem with random forests
> and so I was wondering what I could do to improve the results  -I've
> tried playing with the number of trees and the value of m_try 
> but I dont
> see much change.
> 
> Is there anything that I can do to improve the results for a random
> forest model? (Are there any signifcant papers, apart from 
> Breiman, that
> I should be reading related to random forests?)
> 
> 2) My second question is related to interpretation of the variable
> importance plot using var.imp.plot(). I realise that the variables are
> ordered in order of decreasing importance. However for example I see
> that there is a large decrease in the value of Importance 
> from the first
> variable (ie most important) to the second one. Whereas for 
> other pairs
> the difference in the Importance value is not so large.
> 
> Is the difference between the Importance value a measure of 'how much
> more important' a variable is? Or am I going in the wrong direction?
> 
> In additionn, is there any sort of rule or heuristic that can 
> be used to
> say for example that the first N variables account for the 
> model? Or is
> the interpretation of variable importance descriptive in nature?
> 
> Thanks,
> 
> -------------------------------------------------------------------
> Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net>
> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
> -------------------------------------------------------------------
> Science kind of takes the fun out of the portent business.
> -Hobbes
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}




More information about the R-help mailing list