[R] [Fwd: Re: evaluation question]

Sun Jan 25 22:20:16 CET 2009

dear list,

below is an edited version of my response to an r user asking me for
explaining some issues related to r's evaluation rules.  i find the
problem interesting enough to be forwarded to the list, hopefully for
comments from whoever may want to extend or correct my explanations.

(i'd like to add that much as i'm happy to receive and answer offline
mails, questions related to r are best sent directly to the list, where
the real experts are.)

-------- Original Message --------
Subject: 	Re: evaluation question
Date: 	Sun, 25 Jan 2009 20:32:22 +0100

xxx wrote:

<snip>

> Someone sent in an example a few days ago showing that prac1 ( see
> below ) doesn't work. Then someone else sent two different
> ways of fixing it.
> I'm still slightly confused.

<snip>

>
>
> x<-1:10;
> y<-rnorm(10) + x;
>
> # THIS DOES NOT WORK
>
> prac1 <- function( model,wghts){
>   lm( model, weights = wghts)
> }
>
> prac1(model = y~x, wghts = rep(1, 10))

tfm:

" the variables are taken from 'environment(formula)', typically 
  the environment from which 'lm' is called. "

when lm is applied to a model, the variable names used to pass arguments
to lm (here, 'wghts') are looked up in the environment where the model
was defined.  here, you have two environments:

- the global one (say, e_g), where x, y, and prac1 are defined;
- the call-local one (say, e_p1), created when prac1 is applied.

there is a variable name 'wghts' in the latter, but none in the
former.  just before the call, environmentwise the situation is as follows:

e_g = { 'x':v1, 'y':v2, 'prac1':v3 }

where e_g contains three mappings (of those we are interested here), written here as <name>:<value>, none for
'wghts'.  (the v1, v2, v3 stand for the respective values, as in the
code above.)

when you apply prac1, you create a new, local environment:

e_p1 = { 'model':v4, 'wghts':v5 }

where v4 is a promise with the expression 'y~x' and evaluation
environment e_g (the caller's environment), and v5 is a promise with the
expression 'rep(1, 10)' and evaluation environment e_g.

when you call lm, things are a little bit more complicated.  after some
black magic is performed on the arguments in the lm call, weights are
extracted from the model using model.weights, and the lookup is
performed not in e_p1, but in e_g.

rm(list=ls()) # cleanup
x = 1:10
y = rnorm(10)+x

p1 = function(model, wghts)
    lm(model, weights=wghts)

p1(y~x, rep(1,10))
# (somewhat cryptic) error: no variable named 'wghts' found

wghts = rep(1,10)
p1(y~x, wghts)
# now works, e_g has a binding for 'wghts'
# passing wghts as an argument to p1 makes no difference

note, due to lazy evaluation, the following won't do:

rm(wghts) # cleanup

p1(y~x, wghts<-rep(1,10))
# wghts still not found in e_g

if you happen to generalize your p1 over the additional arguments to be
passed to lm, ugly surprizes await, too:

p2 = function(model, ...) {
    # some additional code
    lm(model, ...) }
p2(y~x, weights=rep(1,10))
# (rather cryptic) error

if you want to fit a model with different sets of weights, the following
won't do:

rm(wghts) # cleanup
lapply(
   list(rep(1,10), rep(c(0.5, 1.5), 5)), # alternative weight vectors
   function(weights) p1(y~x, weights))
# wghts not found in e_g, as before

but this, incidentally, will work:

rm(wghts) # cleanup
lapply(
   list(rep(1,10), rep(c(0.5, 1.5), 5)), 
   function(wghts) p1(y~x, wghts))
# wghts found in e_g, not in e_p1

as will this:

rm(wghts) # cleanup
lapply(
   list(rep(1,10), rep(c(0.5, 1.5), 5)), 
   function(wghts) p1(y~x))
# wghts found in e_g

but obviously not this:

rm(wghts) # cleanup
lapply(
   list(rep(1,10), rep(c(0.5, 1.5), 5)), 
   function(weights) p1(y~x))
# wghts not found

>
> # SOLUTION # 1
>
> prac2 <- function( model,wghts){
>  environment(model) <- environment()
>  lm(model,weights = wghts)
> }
>
> prac2(model = y~x, wghts = rep(1, 10))

environment() returns the local call environment (see e_p1 above), where
'wghts' is mapped to a promise to evaluate rep(1,10) in e_g.  you set
the environment of model to e_p1, so that lm looks for wghts there --
and finds it.

this is an 'elegant' workaround, with possible disastrous
consequences if the model happens to include a variable named 'model' or
'wghts':

model = 1:10
prac2(y~model, rep(1,10))
# can't use model in a formula?

wghts = x
prac2(y~wghts, rep(1,10))
# oops, not quite the same prac2(y~x, rep(1,10))

another problem with this 'elegant' 'solution' is that if prac_ happens to have
local variables with names in conflict with names in the model formula, you're
in trouble again:

prac2 = function(model, wghts) {
    environment(model) = environment()
    x = NULL # for whatever reason one might need an x here
    # whatever
    lm(model, weights = wghts) }

prac2(y~x, rep(1,10))
# oops, NULL is not good an x in the model

these may be unlikely scenarios, but the issue is serious.  you need to
understand the details of how lm is implemented in order to understand
why your (intuitively correct) example above did not work, to understand
why and when your variables can be captured in unexpected ways, and to
know how to write 'elegant' 'solutions' to avoid the problems.  

this also means that writing general purpose modules to be used by others is likely 
to cause errors (or silently produce rubbish) if you're not careful enough.
(or else you need to prepare your code to handle all exotic situations such as
passing weights to prac1 together with a model where one of the variables is
named 'model' or 'wghts' -- this is point 2 in a recent comment by Greg Snow,
which, as he says, stinks hubris.

>
> # SOLUTION # 2
>
> prac3 <- function( model,wghts){
>   cur.env <- environment()
>   lm( model, weights = wghts, data = cur.env )
> }
>
> prac3(model = y~x, wghts = rep(1, 10))
>
>

this is an equally 'good' 'solution', with the above comments equally
applicable here.  you'd have to tell the user of your modules which variable
names are persona non grata.

vQ