[R] lm/model.matrix confusion (? bug)
Gavin Simpson
gavin.simpson at ucl.ac.uk
Wed Dec 12 16:33:05 CET 2007
On Wed, 2007-12-12 at 06:46 -0800, Mark Difford wrote:
> Hi Gavin, Berwin,
>
> Thanks for your detailed replies. I'll make a general reply, if you don't
> mind.
>
> To reiterate, my main point is that if model.matrix() can be used in this
> way, then lm() shouldn't add an intercept.
>
> >> ... lm(Gas ~ model.matrix(~ Insul/Temp - 1), data = whiteside) ....
>
> And the documentation for lm() indicates. albeit indirectly, that
> model.matrix can be used in this way. It calls for a formula, or something
> that can be coerced to one. And the following meets that criterion:
> as.formula(Gas ~ model.matrix (~ Insul/Temp-1, data=whiteside)), and this
> specifies no intercept.
This is the problem - it doesn't!
What is says is model.matrix() don't add an intercept to the returned
model matrix. R then tries to work out what to do with the returned
matrix that is now on the rhs of the formula. But note that at no point
have you gotten round the fact that it will add an intercept *after* the
formula is parsed and a -1 is not found. This is because model.matrix is
called again, internally in lm but not with your formula as an argument
but with a terms object and nowhere in this object is it specified not
to include an intercept.
To do this, lm would have to parse the formula for model.matrix(....)
and if it finds something work accordingly.
>
> On the question of why I want to mess about in such a labarynthine way.
> Well my email was largely expository. With a straight call to lm(), I
> wouldn't bother with model.matrix.
>
> So, it really was about getting at lm coefficients inside a function (when
> you have to get the terms &c. from somewhere else), and trying to understand
> properly how things work, and why they work the way they do, and even if
> they should work the way they do.
>
> For instance:---
>
> if (ols) {
> obj <- x[[1]]
> mt <- terms(obj)
> mf <- model.frame(obj)
> y <- model.response(mf)
> X <- model.matrix(mt, mf, contrasts = obj$contrasts)
> if (attr(mt, "intercept") == 1) ## This is my
> hack to overcome the double-intercept problem
> { olscf <- summary(lm(y ~ X))$coefficients }
> else {
> olscf <- summary(lm(y ~ X - 1))$coefficients
> }
> rownames(olscf) <- rownames(coef(obj))
In this instance, you don't want to be working with lm. You can use
lm.fit which returns an object with $coefficients, so I would guess you
need something like this (not tested)
if (ols) {
obj <- x[[1]]
mt <- terms(obj)
mf <- model.frame(obj)
y <- model.response(mf)
X <- model.matrix(mt, mf, contrasts = obj$contrasts)
fit <- lm.fit(X, y) ## store fit in case you need something else
olscf <- coef(fit) ## get coefficients using an extractor function
##if (attr(mt, "intercept") == 1) ## This is my hack to overcome the double-intercept problem
## { olscf <- summary(lm(y ~ X))$coefficients }
##else {
## olscf <- summary(lm(y ~ X - 1))$coefficients
##}
##rownames(olscf) <- rownames(coef(obj))
}
Note also, that even if you are using your hack, you don't need to do
summary(lm(y ~ X))$coefficients
as
coef(lm(y ~ X))
will do.
lm is mainly there for top-level work. If you need to do things like you
have above, use lm.fit or do the qr decomposition yourself.
Or try a different way to build the formula you need and work with lm;
Bill Venables has a nice piece in R News Vol 2 Issue 2 in the
Programmer's Niche section on something that might be of use if you want
to build a correct formula for your needs from the colnames of X and y?
All the best,
G
>
> Thanks again for your input.
>
> Regards,
> Mark.
>
>
>
>
>
> Berwin A Turlach wrote:
> >
> > G'day Mark,
> >
> > On Wed, 12 Dec 2007 02:05:54 -0800 (PST)
> > Mark Difford <mark_difford at yahoo.co.uk> wrote:
> >
> >> In order to get the same coefficients as we get from the following
> > [...]
> >> we need to do the following (if we use model.matrix to specify the
> >> model)
> >
> > By why would you want to do this?
> >
> >> ##
> >> summary ( lm(Gas ~ model.matrix(~ Insul/Temp - 1) - 1, data =
> >> whiteside) )
> >>
> >> That is, we need to take out "two intercepts." Is this "correct"?
> >
> > Yes.
> >
> >> Shouldn't lm check to see if an intercept has been requested as part
> >> of the model formula?
> >
> > No, it does not. In the Details section of lm's help page you will
> > find the following:
> >
> > A formula has an implied intercept term. To remove this use
> > either 'y ~ x - 1' or 'y ~ 0 + x'. See 'formula' for more details
> > of allowed formulae.
> >
> > Thus, except if you explicitly ask for a constant term not be included,
> > lm will add a constant term (a column of ones) additionally to what
> > ever you have specified on the right hand side of the formula.
> >
> >> If I do
> >> ##
> >> summary(lm(as.formula(Gas ~ model.matrix (~ Insul/Temp-1,
> >> data=whiteside)), data=whiteside))
> >>
> >> we get a strange model.
> >
> > Well, you get a model in which not all parameters are identifiable, and
> > a particular parameter that is not identifiable is estimated by NA. I
> > am not sure what is strange about this.
> >
> >> But the formula part of this model qualifies
> >> as a valid formula
> >> ##
> >> as.formula(Gas ~ model.matrix (~ Insul/Temp-1, data=whiteside))
> >
> > Debatable, the above command only shows that it can be coerced into a
> > valid formula. :)
> >
> >> just as if I were to write: lm(Gas ~ Insul/Temp - 1, data=whiteside)
> >>
> >> But we know that the _correct_ formula is the following
> >
> >> ##
> >> as.formula(Gas ~ model.matrix (~ Insul/Temp-1, data=whiteside) -1)
> >
> > Why is this formula any more correct than the other one? Both specify
> > exactly the same model. It is just that one does it in an
> > overparameterised way.
> >
> >> (Sorry, this is getting really long) --- So, my question/confusion
> >> comes down to wanting to know why lm() doesn't check to see if an
> >> intercept has been specified when the model has been specified using
> >> model.matrix.
> >
> > Because lm() is documented not to check this. If you do not want to
> > have an intercept in the model you have to specifically ask it for.
> >
> > Also, comparing the output of
> > summary( lm(Gas ~ Insul/Temp - 1, data = whiteside) )
> > and
> > summary( lm(Gas ~ Insul/Temp, data = whiteside ) )
> >
> > you can see that lm() does not check whether there is an implicit
> > intercept in the model. Compare the (Adjusted) R-squared values
> > returned; one case is using the formula for models with no intercept
> > the other one the formula for models with intercept. Similar story
> > with the reported F-statistics.
> >
> > Cheers,
> >
> > Berwin
> >
> > =========================== Full address =============================
> > Berwin A Turlach Tel.: +65 6515 4416 (secr)
> > Dept of Statistics and Applied Probability +65 6515 6650 (self)
> > Faculty of Science FAX : +65 6872 3919
> > National University of Singapore
> > 6 Science Drive 2, Blk S16, Level 7 e-mail: statba at nus.edu.sg
> > Singapore 117546 http://www.stat.nus.edu.sg/~statba
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Dr. Gavin Simpson [t] +44 (0)20 7679 0522
ECRC, UCL Geography, [f] +44 (0)20 7679 0565
Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/
UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
More information about the R-help
mailing list