[R] specifying model terms when using predict

Marc Schwartz marc_schwartz at comcast.net
Fri Jan 16 22:30:23 CET 2009


on 01/16/2009 02:20 PM VanHezewijk, Brian wrote:
> I've recently encountered an issue when trying to use the predict.glm
> function.
> 
>  
> 
> I've gotten into the habit of using the dataframe$variablename method of
> specifying terms in my model statements.  I thought this unambiguous
> notation would be acceptable in all situations but it seems models
> written this way are not accepted by the predict function.  Perhaps
> others have encountered this problem as well.

<snip>

The bottom line is "don't do that".  :-)

When the predict.*() functions look for the variable names, they use the
names as specified in the formula that was used in the initial creation
of the model object.

As per ?predict.glm:

Note

Variables are first looked for in newdata and then searched for in the
usual way (which will include the environment of the formula used in the
fit). A warning will be given if the variables found are not of the same
length as those in newdata if it was supplied.


As per your example, using:

 x <- 1:100

 y <- 2 * x

 orig.df <- data.frame(x1 = x, y1 = y)

 lm1 <- glm(orig.df$y1 ~ orig.df$x1, data = orig.df)

 pred1 <- predict(lm1, newdata = data.frame(x1 = 101:150))


When predict.glm() tries to locate the variable "orig.df$x1" in the data
frame passed to 'newdata', it cannot be found. The correct name in the
model is "orig.df$x1", not "x1" as you used above.

Thus, since it cannot find that variable in 'newdata', it begins to look
elsewhere for a variable called "orig.df$x1". Guess what?  It finds it
in the global environment as a column the original dataframe 'orig.df'.

Since that column has a length of 100 and the data frame that you passed
to newdata only has 50, you get an error.

Warning message:

'newdata' had 50 rows but variable(s) found have 100 rows


There is a "method" to the madness and good reason why the modeling
functions and others that take a formula argument also have a 'data'
argument to specify the location of the variables to be used.

HTH,

Marc Schwartz




More information about the R-help mailing list