[Rd] model frames and update()

Thu Apr 23 22:24:36 CEST 2015

Therneau, Terry M., Ph.D. <therneau <at> mayo.edu> writes:

>  This issue has arisen within my anova.coxph routine, but is as
> easily illustrated with glm.

> testdata <- data.frame(y= 1:5,
>                         n= c(8,10,6,20,14),
>                         sex = c(0,1,0,1,1),
>                         age = c(30,20,35,25,40))
> 
> fit <- glm(cbind(y,n) ~ age + sex, binomial, data=testdata, model=TRUE)
> saveit <- fit$model
> 
> update(fit, .~. - age, data=saveit)
> Error in cbind(y, n) : object 'y' not found

> One would hope that a saved model frame is precisely the thing that
> would work best. The issue of course is that "cbind(y, n)" is the
> name of the first variable in saveit, and it is not being properly
> quoted somewhere down the line.  The same issue can occur on the
> right hand side.  "Save the model frame in case you need to refit
> something next month" is does not appear to be a safe approach to
> reproducable research.

> fit2 <- glm(y ~ sex + log(age), poisson, testdata)
> save2 <- fit2$model
> update(fit2, . ~ . - sex, data=save2)  # fails
> glm(y ~ log(age), poisson, save2)      # fails

> I can work around this in my anova, but I wanted to not rebuild the
> frame if it is already present.  It looks like model.matrix plus
> attr(x, 'assign') time -- a bit harder to read, but that looks like
> what anova.glm is doing.  Is there a way to make update work?

> The current code, BTW, starts by building its own frame using
> results of terms.inner, which solves the above issue nicely and
> update() works as expected.  But it isn't robust to scoping issues.
> (As pointed out yesterday by a user: lapply of a function that
> contained coxph followed by anova gives a variable not found error.)
> It also ignores saved model frames; thus the rewrite.  Terry T

I started to complain about this sort of thing last month, at

http://article.gmane.org/gmane.comp.lang.r.devel/37805

I was speaking of updating more generally (which might change
anything, not just the formula), but I complained that

   * we could try to use the model frame (which is stored already),
but there are issues with this (the basis of a whole separate rant)
because the model frame stores something in between predictor
variables and input variables. For example

   d <- data.frame(y=1:10,x=runif(10))
   names(model.frame(lm(y~log(x),data=d)))
   ## "y" "log(x)"

So if we wanted to do something like update to "y ~ sqrt(x)",
it wouldn't work ...

  A mini-version of that rant is that I find the model frame structure
quite awkward -- it contains neither what I would call _input_
variables (actual measured things that live in columns in the original
data frame) nor _predictor_ variables (columns associated with
parameters in the model, i.e. columns of the model matrix).  It seems
to me that life would be much easier if the model frame contained just
the intersection of all.vars(formula) with the column names of the
original data set (and with NA values dropped according to
na.action()). I appreciate that this is a potentially difficult design
problem with ramifications that I don't understand, but ... Martin
Maechler tried to explain the rationale for the design to me once, but
I didn't manage to understand his argument (so I have since forgotten
it).

On the other end, it would be nice (in some dream world) to have
the capability to associate a model with a *reference* to a model
frame; consider the situation where one is fitting 10 or 20 different
models to a large data set, ending up with many copies (I'm not
100% sure, but I think that using model.frame() will end up creating
an internal copy of the data even if it's not technically modified)
of the same gigantic data ...

  Ben Bolker