[R] randomForest() for regression produces offset predictions
david at davidkatzconsulting.com
Fri Dec 21 00:37:52 CET 2007
I would expect this regression towards the mean behavior on a new or hold out
dataset, not on the training data. In RF terminology, this means that the
model prediction from predict is the in-bag estimate, but the out-of-bag
estimate is what you want for prediction. In Joshua's example,
rf.rf$predicted is an out-of-bag estimate, but since newdata is given, it
appears that the result is the in-bag estimate, which still needs an
adjustment like Joshua's (and perhaps a more complex one might be needed in
some cases). This is a bit confusing since predict() usually matches what's
in model$fitted.values. I imagine that's why the author used "predicted" as
the component name instead of the standard "fitted.values".
The documentation for predict.randomForest explains:
"newdata - a data frame or matrix containing new data. (Note: If not given,
the out-of-bag prediction in object is returned. "
Patrick Burns wrote:
> What I see is the predictions being less extreme than the
> actual values -- predictions for large actual values are smaller
> than the actual, and predictions for small actual values are
> larger than the actual. That makes sense to me. The object
> is to maximize out-of-sample predictive power, not in-sample
> predictive power.
> Or am I missing something in what you are saying?
> Patrick Burns
> patrick at burns-stat.com
> +44 (0)20 8525 0696
> (home of S Poetry and "A Guide for the Unwilling S User")
> Joshua Knowles wrote:
>>I have observed that when using the randomForest package to do regression,
>>predicted values of the dependent variable given by a trained forest are
>>centred and have the wrong slope when plotted against the true values.
>>This means that the R^2 value obtained by squaring the Pearson correlation
>>better than those obtained by computing the coefficient of determination
>>directly. The R^2 value obtained by squaring the Pearson can, however, be
>>exactly reproduced by the coeff. of det. if the predicted values are first
>>linearly transformed (using lm() to find the required intercept and
>>Does anyone know why the randomForest behaves in this way - producing
>>predictions? Does anyone know a fix for the problem?
>>(By the way, the feature is there even if the original dependent variable
>>values are initially transformed to have zero mean and unit variance.)
>>As an example, here is some simple R code that uses the available swiss
>>dataset to show the effect I am observing.
>>Thanks for any help.
>>#### EXAMPLE OF RANDOM FOREST REGRESSION
>>#Build the random forest to predict Infant Mortality
>>rf.rf<-randomForest(Infant.Mortality ~ ., data=swiss)
>>#And predict the training set again
>>#Plotting predicted against actual values shows the effect (uncomment to
>># calculate R^2 as pearson coefficient squared
>># calculate R^2 value as fraction of variance explained
>># now fit a line through the predicted and true values and
>># use this to normalize the data before calculating R^2
>>fit<-lm(actual ~ pred)
>>cat("Pearson squared = ",R2one,"\n")
>>cat("Coeff of determination = ", R2two, "\n")
>>cat("Coeff of determination after linear fitting = ", R2three, "\n")
> R-help at r-project.org mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
View this message in context: http://www.nabble.com/randomForest%28%29-for-regression-produces-offset-predictions-tp14415517p14447468.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help