[R] formatting data for predict()

Andrew Miles rstuff.miles at gmail.com
Sun Sep 26 06:38:45 CEST 2010


I'm trying to get predicted probabilities out of a regression model,  
but am having trouble with the "newdata" option in the predict()  
function.  Suppose I have a model with two independent variables, like  
this:

y=rbinom(100, 1, .3)
x1=rbinom(100, 1, .5)
x2=rnorm(100, 3, 2)
mod=glm(y ~ x1 + x2, family=binomial)

I can then get the predicted probabilities for the two values of x1,  
holding x2 constant at 0 like this:

p2=predict(mod, type="response", newdata=as.data.frame(cbind(x1, x2=0)))
unique(p2)

However, I am running regressions as part of a function I wrote, which  
feeds in the independent variables to the regression in matrix form,  
like this:

dat=cbind(x1, x2)
mod2=glm(y ~ dat, family=binomial)

The results are the same as in mod.  Yet I cannot figure out how to  
input information into the "newdata" option of predict() in order to  
generate the same predicted probabilities as above.  The same code as  
above does not work:

p2a=predict(mod2, type="response", newdata=as.data.frame(cbind(x1,  
x2=0)))
unique(p2a)

Nor does creating a data frame that has the names "datx1" and "datx2,"  
which is how the variables appear if you run a summary() on mod2.   
Looking at the model matrix of mod2 shows that the fitted model only  
shows two variables, the dependent variable y and one independent  
variable called "dat."  It is as if my two variables x1 and x2 have  
become two levels in a factor variable called "dat."

names(mod2$model)

My question is this:  if I have a fitted model like mod2, how do I use  
the "newdata" option in the predict function so that I can get the  
predicted values I am after?  I.E. how do I recreate a data frame with  
one variable called "dat" that contains two levels which represent my  
(modified) variables x1 and x2?

Thanks in advance!

Andrew Miles
Department of Sociology
Duke University



More information about the R-help mailing list