[R] postResample R² and lm() R²
Max Kuhn
mxkuhn at gmail.com
Wed Dec 12 02:37:48 CET 2007
On Dec 11, 2007 3:35 PM, Giovane <gufrgs at gmail.com> wrote:
>
> So here comes my doubt: why do I have an value of 67.52% for R² when
> creating the model(that is , the model explains 67.52% of the data) and
> when I use this same model on the same input data, why does postResample
> return a very different value associated to R²?
>
Let's get in the WayBack machine and return to 4 days ago when I said:
> As has been previously noted on this list, there are a number of
> formulas for R-squared. This function uses the square of the
> correlation between the observed and predicted. The next version of
> caret will offer a choice of formulas.
For your data:
> cor(prediction, input$TOTAL)^2
[1] 0.3300378
For R-squared, summary.lm uses
ans$r.squared <- mss/(mss + rss)
ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)
and for your data rdf = 31, df.int = 0 and n = 35.
In other words, the Rsquared estimate form summary.lm adjusts for the
degrees of freedom and postResample does not.
Why doesn't it use the df? In ?postResample you would see
"Note that many models have more predictors (or parameters) than data
points, so the typical mean squared error denominator (n - p) does not
apply. Root mean squared error is calculated using sqrt(mean((pred -
obs)^2)). Also, R-squared is calculated as the square of the
correlation between the observed and predicted outcomes."
Since caret is useful for comparing different types of models, we use
biased estimate of the root MSE since we would like to directly
compare the RMSE from different models (say a linear regression and a
support vector machine). Many of these models do not have an explicit
number of parameters, so we use
mse <- mean((pred - obs)^2)
Max
More information about the R-help
mailing list