[R] R2 always increases as variables are added?

=?GB2312?B?wO6/ob3c?= klijunjie at gmail.com
Tue May 22 06:08:45 CEST 2007


Hi, Lynch,

Thank you for attention first.

I am also not a statistician and have just taken several statistics classes.
So it is natral for us to ask some question seeming naive to statisticans.

I am sorry that I cannot agree with your point that we must always include
intercept in our model. becaus if true intercept is zero, the strategy of
you or your textbook will be have 2 losses. First, there will be
explaination problem. If true intercept is zero and your estimate of it is
not zero, the result of regression is misleading. However, it might be not
so serious as we judge those coefficients which are actually zeros to be
none-zeros, but the misjudge here is still a loss in some
extent. Secondly, if true intercept is zero, your strategy's predictive
ability is often lower than other strategies which do not always include
intercept.

If you are interested in the performance of your strategies, e.g. maximizing
adjusted R^2 always with intercept. you can run the code I put in the
attachment.
It will show that maximizing adjusted R^2 NOT always with intercept beats
maximizing adjusted R^2 always with intercept.

Junjie





2007/5/22, Paul Lynch <plynchnlm at gmail.com>:
>
> Junjie,
>    First, a disclaimer:  I am not a statistician, and have only taken
> one statistics class, but I just took it this Spring, so the concepts
> of linear regression are relatively fresh in my head and hopefully I
> will not be too inaccurate.
>    According to my statistics textbook, when selecting variables for
> a model, the intercept term is always present.  The "variables" under
> consideration do not include the constant "1" that multiplies the
> intercept term.  I don't think it makes sense to compare models with
> and without an intercept term.  (Also, I don't know what the point of
> using a model without an intercept term would be, but that is probably
> just my ignorance.)
>    Similarly, the formula you were using for R**2 seems to only be
> useful in the context of a standard linear regression (i.e., one that
> includes an intercept term).  As your example shows, it is easy to
> construct a "fit" (e.g. y = 10,000,000*x) so that SSR > SST if one is
> not deriving the fit from the regular linear regression process.
>          --Paul
>
> On 5/19/07, Àî¿¡½Ü <klijunjie at gmail.com> wrote:
> > I know that "-1" indicates to remove the intercept term. But my question
> is
> > why intercept term CAN NOT be treated as a variable term as we place a
> > column consited of 1 in the predictor matrix.
> >
> > If I stick to make a comparison between a model with intercept and one
> > without intercept on adjusted r2 term, now I think the strategy is
> always to
> > use another definition of r-square or adjusted r-square, in which
> > r-square=sum(( y.hat)^2)/sum((y)^2).
> >
> > Am I  in the right way?
> >
> > Thanks
> >
> > Li Junjie
> >
> >
> > 2007/5/19, Paul Lynch <plynchnlm at gmail.com>:
> > > In case you weren't aware, the meaning of the "-1" in y ~ x - 1 is to
> > > remove the intercept term that would otherwise be implied.
> > >     --Paul
> > >
> > > On 5/17/07, Àî¿¡½Ü <klijunjie at gmail.com> wrote:
> > > > Hi, everybody,
> > > >
> > > > 3 questions about R-square:
> > > > ---------(1)----------- Does R2 always increase as variables are
> added?
> > > > ---------(2)----------- Does R2 always greater than 1?
> > > > ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> > > > calculated? It is different from (r.square=sum((y.hat-mean
> > > > (y))^2)/sum((y-mean(y))^2))
> > > >
> > > > I will illustrate these problems by the following codes:
> > > > ---------(1)-----------  R2  doesn't always increase as
> > variables are added
> > > >
> > > > > x=matrix(rnorm(20),ncol=2)
> > > > > y=rnorm(10)
> > > > >
> > > > > lm=lm(y~1)
> > > > > y.hat=rep(1*lm$coefficients,length(y))
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 2.646815e-33
> > > > >
> > > > > lm=lm(y~x-1)
> > > > > y.hat=x%*%lm$coefficients
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 0.4443356
> > > > >
> > > > > ################ This is the biggest model, but its R2 is not the
> > biggest,
> > > > why?
> > > > > lm=lm(y~x)
> > > > > y.hat=cbind(rep(1,length(y)),x)%*%lm$coefficients
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 0.2704789
> > > >
> > > >
> > > > ---------(2)-----------  R2  can greater than 1
> > > >
> > > > > x=rnorm(10)
> > > > > y=runif(10)
> > > > > lm=lm(y~x-1)
> > > > > y.hat=x*lm$coefficients
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 3.513865
> > > >
> > > >
> > > >  ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> > > > calculated? It is different from (r.square=sum((y.hat-mean
> > > > (y))^2)/sum((y-mean(y))^2))
> > > > > x=matrix(rnorm(20),ncol=2)
> > > > > xx=cbind(rep(1,10),x)
> > > > > y=x%*%c(1,2)+rnorm(10)
> > > > > ### r2 calculated by lm(y~x)
> > > > > lm=lm(y~x)
> > > > > summary(lm)$r.squared
> > > > [1] 0.9231062
> > > > > ### r2 calculated by lm(y~xx-1)
> > > > > lm=lm(y~xx-1)
> > > > > summary(lm)$r.squared
> > > > [1] 0.9365253
> > > > > ### r2 calculated by me
> > > > > y.hat=xx%*%lm$coefficients
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 0.9231062
> > > >
> > > >
> > > > Thanks a lot for any cue:)
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Junjie Li,                  klijunjie at gmail.com
> > > > Undergranduate in DEP of Tsinghua University,
> > > >
> > > >         [[alternative HTML version deleted]]
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.
> > > >
> > >
> > >
> > > --
> > > Paul Lynch
> > > Aquilent, Inc.
> > > National Library of Medicine (Contractor)
> > >
> >
> >
> >
> > --
> >
> > Junjie Li,                  klijunjie at gmail.com
> > Undergranduate in DEP of Tsinghua University,
>
>
> --
> Paul Lynch
> Aquilent, Inc.
> National Library of Medicine (Contractor)
>



-- 
Junjie Li,                  klijunjie at gmail.com
Undergranduate in DEP of Tsinghua University,


More information about the R-help mailing list