[R] Strange R squared, possible error

Gabor Grothendieck ggrothendieck at gmail.com
Thu Mar 17 16:36:38 CET 2011


On Wed, Mar 16, 2011 at 3:49 PM, derek <jan.kacaba at gmail.com> wrote:
> k=lm(y~x)
> summary(k)
> returns R^2=0.9994
>
> lm(y~x) is supposed to find coef. a anb b in y=a*x+b
>
> l=lm(y~x+0)
> summary(l)
> returns R^2=0.9998
> lm(y~x+0) is supposed to find coef. a in y=a*x+b while setting b=0
>
> The question is why do I get better R^2, when it should be otherwise?
>
> Im sorry to use the word "MS exel" here, but I verified it in exel and it
> gives:
> R^2=0.9994 when y=a*x+b is used
> R^2=0.99938 when y=a*x+0 is used
>

The idea is that if you have a positive quantity that can be broken
down into two nonnegative quantities: X = X1 + X2 then it makes sense
to ask what proportion X1 is of X.   For example: 10 = 6 + 4 and 6 is
.6 of the total.

Now, in the case of a model with an intercept its a mathematical fact
that the variance of the response equals the variance of the fitted
model plus the variance of the residuals.  Thus it makes sense to ask
what fraction of the variance of the response is represented by the
variance of the fitted model (this fraction is R^2).

But if there is no intercept then that mathematical fact breaks down.
That is, its no longer true that the variance of the response equals
the variance of the fitted model plus the variance of the residuals.
Thus how meaningful is it to ask what proportion the variance of the
fitted model is of the variance of the response in the first place?
In this case we need to rethink the entire approach which is why a
different formula is required.

Also, maybe the real problem is not this at all. That is perhaps you
are not really trying to find the goodness of fit but rather you are
trying to compare two particular models: one with intercept and one
without.  In that case R^2 is not really what you want.  Instead use
the R anova command. For example, using the built in BOD data frame:

> fm <- lm(demand ~ Time, BOD)
> fm0 <- lm(demand ~ Time - 1, BOD)
> anova(fm, fm0)
Analysis of Variance Table

Model 1: demand ~ Time
Model 2: demand ~ Time - 1
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)
1      4  38.069
2      5 135.820 -1   -97.751 10.271 0.03275 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here we see that the residual sum of squares is much less for the full
model than for the reduced model and its significant at the 3.275%
level.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list