[R] randomForest() for regression produces offset predictions

David Katz david at davidkatzconsulting.com
Fri Dec 21 00:37:52 CET 2007

I would expect this regression towards the mean behavior on a new or hold out
dataset, not on the training data. In RF terminology, this means that the
model prediction from predict is the in-bag estimate, but the out-of-bag
estimate is what you want for prediction. In Joshua's example,
rf.rf$predicted is an out-of-bag estimate, but since newdata is given, it
appears that the result is the in-bag estimate, which still needs an
adjustment like Joshua's  (and perhaps a more complex one might be needed in
some cases). This is a bit confusing since predict() usually matches what's
in model$fitted.values. I imagine that's why the author used "predicted" as
the component name instead of the standard "fitted.values".

The documentation for predict.randomForest explains:

"newdata - a data frame or matrix containing new data. (Note: If not given,
the out-of-bag prediction in object is returned. " 

Patrick Burns wrote:
> What I see is the predictions being less extreme than the
> actual values -- predictions for large actual values are smaller
> than the actual, and predictions for small actual values are
> larger than the actual.  That makes sense to me.  The object
> is to maximize out-of-sample predictive power, not in-sample
> predictive power.
> Or am I missing something in what you are saying?
> Patrick Burns
> patrick at burns-stat.com
> +44 (0)20 8525 0696
> http://www.burns-stat.com
> (home of S Poetry and "A Guide for the Unwilling S User")
> Joshua Knowles wrote:
>>Hi all,
>>I have observed that when using the randomForest package to do regression,
>>predicted values of the dependent variable given by a trained forest are
>>centred and have the wrong slope when plotted against the true values.
>>This means that the R^2 value obtained by squaring the Pearson correlation
>>better than those obtained by computing the coefficient of determination 
>>directly. The R^2 value obtained by squaring the Pearson can, however, be 
>>exactly reproduced by the coeff. of det. if the predicted values are first 
>>linearly transformed (using lm() to find the required intercept and
>>Does anyone know why the randomForest behaves in this way - producing
>>predictions? Does anyone know a fix for the problem?
>>(By the way, the feature is there even if the original dependent variable 
>>values are initially transformed to have zero mean and unit variance.)
>>As an example, here is some simple R code that uses the available swiss 
>>dataset to show the effect I am observing.
>>Thanks for any help.
>>#Build the random forest to predict Infant Mortality
>>rf.rf<-randomForest(Infant.Mortality ~ ., data=swiss)
>>#And predict the training set again
>>#Plotting predicted against actual values shows the effect (uncomment to
>># calculate R^2 as pearson coefficient squared
>># calculate R^2 value as fraction of variance explained
>>R2two<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
>># now fit a line through the predicted and true values and
>># use this to normalize the data before calculating R^2
>>fit<-lm(actual ~ pred)
>>R2three<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
>>cat("Pearson squared = ",R2one,"\n")
>>cat("Coeff of determination = ", R2two, "\n")
>>cat("Coeff of determination after linear fitting = ", R2three, "\n")
>>## END
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

View this message in context: http://www.nabble.com/randomForest%28%29-for-regression-produces-offset-predictions-tp14415517p14447468.html
Sent from the R help mailing list archive at Nabble.com.

More information about the R-help mailing list