[R] Question about randomForest

Wed Apr 4 20:18:43 CEST 2012

> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Saruman
> 
> I dont see how this answered the original question of the poster.
> 
> He was quite clear: the value of the predictions coming out 
> of RF do not
> match what comes out of the predict function using the same 
> RF object and
> the same data. Therefore, what is predict() doing that is 
> different from RF?
> Yes, RF is making its predictions using OOB, but nowhere does 
> it say way
> predict() is doing; indeed, it says if newdata is not given, then the
> results are just the OOB predictions. But newdata=oldata, then
> predict(newdata) != OOB predictions. So what is it then? 

Let me make this as clear as I possibly can:  If predict() is called without newdata, all it can do is assume prediction on the training set is desired.  In that case it returns the OOB prediction.  If newdata is given in predict(), it assumes it is "new" data and thus makes prediction using all trees.  If you just feed the training data as newdata, then yes, you will get overfitted predictions.  It almost never make sense (to me anyway) to make predictions on the training set.

> Opens another issue, which is if newdata is close but not 
> exactly oldata,
> then you get overfitted results?

Possibly, depending on how "close" the new data are to the training set.  This applies to nearly _ALL_ methods, not just RF.

Andy

> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Question-about-randomForest-tp41
11311p4529770.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}