[Rd] model.frame(), model.matrix(), and derived predictor variables

Ben Bolker bbolker at gmail.com
Thu Aug 29 15:21:52 CEST 2013


On 13-08-28 05:43 PM, Gabriel Becker wrote:
> Ben,
> 
> It works for me ...
>> x = rpois(100, 5) + 1
>> y = rnorm(100, x)
>> d = data.frame(x,y)
>> m <- lm(y~log(x),d)
>> update(m,data=model.frame(m))
> 
> Call:
> lm(formula = y ~ log(x), data = model.frame(m))
> 
> Coefficients:
> (Intercept)       log(x) 
>      -4.010        5.817 
> 
> 

    That's because x and y are still lying around in your global
environment.  If you rm(x); rm(y) then it won't work any more.  And it
wouldn't have worked if you had constructed your model frame directly as

 d = data.frame(x=rpois(100,5)+1)
 d = transform(d,y=rnorm(100,x))

> 
> You can also re-fit using the model.matrix directly. In your example,
> the model frame can be passed directly to lm.fit /lm.wfit.

    Yes, if I want to refit the same model.  But if I want to do
something else with the model (e.g. try fitting vs. x instead of log(x),
or some other function of x) then it doesn't work.

  cheers
    Ben
> 
> 
> ~G
> 
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
>  [7] LC_PAPER=C                 LC_NAME=C                
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C           
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base    
> 
> loaded via a namespace (and not attached):
> [1] tools_3.0.1
> 
> 
> 
> 
> On Sat, Aug 24, 2013 at 7:40 PM, Ben Bolker <bbolker at gmail.com
> <mailto:bbolker at gmail.com>> wrote:
> 
> 
>       Bump: just trying one more time to see if anyone had thoughts on this
>     (so far it's just <crickets> ...)
> 
> 
>     -------- Original Message --------
>     Subject: model.frame(), model.matrix(), and derived predictor variables
>     Date: Sat, 17 Aug 2013 12:19:58 -0400
>     From: Ben Bolker <bbolker at gmail.com <mailto:bbolker at gmail.com>>
>     To: R-devel at stat.math.ethz.ch <mailto:R-devel at stat.math.ethz.ch>
>     <R-devel at stat.math.ethz.ch <mailto:R-devel at stat.math.ethz.ch>>
> 
> 
>       Dear r-developers:
> 
>       I am struggling with some fundamental aspects of model.frame().
> 
>       Conceptually, I think of a flow from data -> model.frame() ->
>     model.matrix; the data contain _input variables_, while model.matrix
>     contains _predictor variables_: data have been transformed, splines and
>     polynomials have been expanded into their corresponding
>     multi-dimensional bases, and factors have been expanded into appropriate
>     sets of dummy variables depending on their contrasts.
>       I originally thought of model.frame() as containing input variables as
>     well (but with only the variables needed by the model, and with cases
>     containing NAs handled according to the relevant na.action setting), but
>     that's not quite true.  While factors are retained as-is, splines and
>     polynomials and parameter transformations are evaluated. For example
> 
>     d <- data.frame(x=1:10,y=1:10)
>     model.frame(y~log(x),d)
> 
>     produces a model frame with columns 'y', 'log(x)' (not 'y', 'x').
> 
>     This makes it hard (impossible?) to use the model frame to re-evaluate
>     the existing formula in a model, e.g.
> 
>     m <- lm(y~log(x),d)
>     update(m,data=model.frame(m))
>     ## Error in eval(expr, envir, enclos) : object 'x' not found
> 
>     It seems to me that this is a reasonable thing to want to do
>     (i.e. use the model frame as a stored copy of the data that
>      can be used for additional model operations); otherwise, I
>     either need to carry along an additional copy of the data in
>     a slot, or hope that the model is still living in an environment
>     where it can find a copy of the original data.
> 
>     Does anyone have any insights into the original design choices,
>     or suggestions about how they have handled this within their own
>     code? Do you just add an additional data slot to the model?  I've
>     considered trying to write some kind of 'augmented' model frame, that
>     would contain the equivalent of
>     that appeared in the formula but not in the model frame ...].
>     setdiff(all.vars(formula),model.frame(m)) [i.e.  all input variables
>     that appeared in the formula but not in the model frame ...].
> 
>       thanks
>        Ben Bolker
> 
>     ______________________________________________
>     R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
> 
> 
> -- 
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis



More information about the R-devel mailing list