[R] Strange R squared, possible error

Thu Mar 17 18:46:50 CET 2011

It is all a matter of what you are comparing too, or what the null model is.  For most cases (standard regression) we compare a model with slope and intercept to an intercept only model (looking at the effect of the slope), the intercept only model fits a horizontal line through the mean of the y's hence the subtraction of the mean.  If we don't do that then R-squared can easily become meaningless.  Here is an example where we compute the r-squared using the no-intercept formula:

x <- rnorm(100, 1000, 20)
y <- rnorm(100, 1000, 20)
cor(x,y)

summary( lm( y ~ rep(1,100) + x + 0 ) )

Notice how big the r-squared value is (and that it is not anywhere near the square of the correlation) for data that is pretty independent.

When you force the intercept to 0, then you are using a different null model (mean 0).  Part of Thomas's point was that if we still subtract the mean in this case then the calculation of r-squared can give a negative number, which you pointed out is meaningless, the gist is that that is the incorrect formula to use and so R instead uses the formula without subtracting the mean when you don't fit an intercept.

The reason the r-squared values are different is because they are using different denominators and are therefore not comparable.

The reason that R uses 2 different formulas/denominators is because there is not one single formula/denominator that makes general sense in both cases.

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of derek
> Sent: Thursday, March 17, 2011 9:29 AM
> To: r-help at r-project.org
> Subject: Re: [R] Strange R squared, possible error
> 
> Thats exactly what I would like to do. Any idea on good text? I've
> consulted
> severel texts, but no one defined R^2 as R^2 = 1 - Sum(R[i]^2) /
> Sum((y[i])^2-y*)) still less why to use different formulas for similar
> model
> or why should be R^2 closer to 1 when y=a*x+0 than in general model
> y=a*x+b.
> 
> from manual:
> r.squared R^2, the ‘fraction of variance explained by the model’,
> R^2 = 1 - Sum(R[i]^2) / Sum((y[i]- y*)^2),
> where y* is the mean of y[i] "if there is an intercept" and zero
> otherwise.
> 
> I don't need explaining what R^2 does nor how to interpret it, because
> I
> know what it means and how it is derived. I don't need to be told which
> model I should apply. So the answers from Thomas weren't helpful.
> 
> I don't claim it is wrong, otherwise wouldn't be employed, but I want
> to see
> the reason behind using two formulas.
> 
> Control questions:
> 1) Statement "if there is an intercept" means intercept including zero
> intercept?
> 
> 2) If I use model y = a*x+0 which formula for R^2 is used: the one with
> Y*
> or the one without?
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Strange-R-
> squared-possible-error-tp3382818p3384844.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.