[R] R2 always increases as variables are added?

Jari Oksanen jari.oksanen at oulu.fi
Tue May 22 09:24:54 CEST 2007

李俊杰 <klijunjie <at> gmail.com> writes:

> Hi, Lynch,
> Thank you for attention first.
> I am also not a statistician and have just taken several statistics classes.
> So it is natral for us to ask some question seeming naive to statisticans.
> I am sorry that I cannot agree with your point that we must always include
> intercept in our model. becaus if true intercept is zero, the strategy of
> you or your textbook will be have 2 losses. First, there will be
> explaination problem. If true intercept is zero and your estimate of it is
> not zero, the result of regression is misleading. However, it might be not
> so serious as we judge those coefficients which are actually zeros to be
> none-zeros, but the misjudge here is still a loss in some
> extent. Secondly, if true intercept is zero, your strategy's predictive
> ability is often lower than other strategies which do not always include
> intercept.
I'm not a statistician, but I've seen much damage done with regression forced
through zero in my field (ecology). This technique is tought in many statistical
textbooks  popular among ecologists. The key problem here is: how do you *know*
that the intercept is zero? Even in logically compelling cases it is very easy
to reach false certainty of zero intercept. A typical case in ecology is where
people study  the number of species against biomass, and argue that there *must*
be zero species when biomass = 0 (if there is nothing, then there is nothing).
The conclusion is that you must fit a model with no intercept. Let's see a
typical example (and I'm so confident that I won't put any random number seed
for this):

mass <- runif(100, 10, 500) # typical range for plant biomass/m^2
spno <- rpois(100, 12) # Moderate number of species independent of mass
summary(lm(spno ~ mass - 1)) # WRONG!
summary(lm(spno ~ mass)) # More or less correct

It is not sufficient to know that the value must be zero in a certain point, you
also should know how that point is scaled: it may make sense to say that spno =
0 at log(mass) = -Inf, but then it does not make sense to force regression
through that point. In particular, when the zero-point is extrapolated from the
data, it is dangerous to force regression through the origin. Further, if your x
does not have a really natural scale, but you can replace x with x - constant
(like x - mean(x)), then it hardly makes sense to play with zero intercepts.  

There may be cases where forcing regression through zero makes sense, but they
seem to be very rare. I've seen them very rarely.

There is an exegetic text on the issue at
http://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf which also touches this issue
(page 3) and makes a nice reading anyhow.

Cheers, Jari Oksanen

More information about the R-help mailing list