[Rd] model.frame() call from inside a function (PR#3671)

Thu Aug 7 09:07:06 MEST 2003

jerome at hivnet.ubc.ca writes:

> R version: 1.7.1
> OS: Red Hat Linux 7.2
> 
> Hi all,
> 
> The formula object in model.frame() is not retrieved properly when 
> model.frame() is called from within a function and the "subset" argument 
> is supplied.
> 
> foo <- function(formula,data,subset=NULL)
> {
>   cat("\n*****Does formula[-3] == ~y ?**** TRUE *****\n")
>   print(formula[-3] == ~y)
> 
>   cat("\n*****Result of model.frame() using formula[-3]**** FAIL *****\n")
>   print(try(model.frame(formula[-3],data=data,subset=subset)))
> 
>   cat("\n*****Result of model.frame() using ~y**** WORKS *****\n")
>   print(try(model.frame(~y,data=data,subset=subset)))
> }
> dat <- data.frame(y=c(5,25))
> foo(y~1,dat)
> 
> Curiously, if the "subset" argument is removed from the call to 
> model.frame(), then the execution is successful in both cases.
> 
> In ?model.frame, one can read:
>      Variables in the formula, `subset' and in `...' are looked for
>      first in `data' and then in the environment of `formula': see the
>      help for `formula()' for further details.
> 
> However, replacing the line
>     subset <- eval(substitute(subset), data, env)
> by
>     subset <- eval(substitute(subset), data, environment())
> in model.frame.default() fixes this problem. I don't know if this 
> correction would create more problems in other cases. Perhaps there is a 
> better fix.

There is really nothing to fix, at least if you go by the rule that it
is only a bug if it behaves contrary to documentation:

There is no "subset" in the environment of "formula", nor in the
"data". If you put one there, the error goes away

> subset<-NULL
> foo(y~1,dat,subset=1)

*****Does formula[-3] == ~y ?**** TRUE *****
[1] TRUE

*****Result of model.frame() using formula[-3]**** FAIL *****
   y
1  5
2 25

*****Result of model.frame() using ~y**** WORKS *****
  y
1 5

However, notice that it is not the same subset. 

There's a whole area of similar nastiness grouped under the heading of
"nonstandard evaluation rules". The basic issue is that you will often
assume that the variables used for subsetting comes from the same
place as those in the model, e.g. in lm(fat~age,subset=sex=="male").

The problem is that it gets really awkward when a function wants to
compute the subset variable and combine it with a formula passed as an
argument. And it only gets worse when arguments can be both scalar and
vector, e.g.

plot(fat~age, col=as.numeric(sex))
function(mycolor="green") plot(fat~age, col=mycolor)

We have discussed changing this on several occasions, e.g. by
requiring that arguments that need to be evaluated in the formula
environment or the data frame should be either model formulas
themselves or quoted expressions. However, that would break S-PLUS
compatibility and also a large body of existing analysis code.

[[ I did discover yesterday (or maybe I was just reminded...) that we
even have nonstandard nonstandard evaluation rules in some places
(nls() seems to evaluate its model formula in the global environment
even if it is given explicitly within a function:

  f <- function() {
    g <- function(a,x) exp(-a*x)
    nls(y~g(a,x),start=list(a=.1))
  }
  x <- 1:10
  y <- exp(-.12*x)+rnorm(10,sd=.001)
  f()
  Error in eval(expr, envir, enclos) : couldn't find function "g"

Argh...]]
-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907