[R] Predicting and Plotting "hypothetical" values of factors

Gabor Grothendieck ggrothendieck at gmail.com
Fri Nov 18 07:42:00 CET 2005


On 11/17/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> On 17 Nov 2005 18:00:45 +0100, Peter Dalgaard <p.dalgaard at biostat.ku.dk> wrote:
> > Paul Johnson <pauljohn at ku.edu> writes:
> >
> > > Last Friday, I noticed that it is difficult to work with regression
> > > models in which there are factors.  It is easier to do the old fashioned
> > > thing of coding up "dummy" variables with 0-1 values.  The predict
> > > function's newdata argument is not suited to insertion of hypothetical
> > > values for the factor, whereas it has no trouble with numeric variables.
> > >   For example, if one uses a factor as a predictor in a logistic
> > > regression, then it is tough to plot the S-shaped curve that describes
> > > the probabilities.
> > >
> > > Here is some code with comments describing the difficulties that we have
> > > found.  Are there simpler, less frustrating approaches?
> >
> > I think the point is that if you think a factor can take intermediate
> > values in between the groups (as in sex==1.245), then it is really a
> > numeric variable and should be treated as such in the model, whether
> > or not it has only two distinct values in your data set.
> >
> > It is not obvious to me what the "S shaped curve" means in a model
> > which really only specifies two probabilities.
> >
> >        -p
> >
> >
> > > -----------------------------
> > > # Paul Johnson  <pauljohn at ku.edu> 2005-11-17
> > > # factorFutzing-1.R
> > >
> > >
> > > myfac <- factor(c(1.1, 4.5, 1.1, 1.1, 4.5, 1.1, 4.5, 1.1))
> > >
> > > y <- c(0,1,0,1,0,0,1,0)
> > >
> > > mymod1 <-glm (y~myfac, family=binomial)
> > >
> > > p1 <- predict(mymod1, type="response")
> > >
> > > plot(myfac,p1)
> > >
> > > # Wait, that's not the picture I wanted. I want to see the
> > > # proverbial S-shaped curve!
> > > #
> > >
> > > # The contrast variable that R creates is coded 0 and 1.
> > > # I'd like to have a sequence from 0 to 1 (seq(0,1,length.out=10)) and
> > > # get predicted values for each.  That would give the S-shaped curve.
> > >
> > >
> > > # But the use of predict does not work because the factor has 2
> > > # values and the predict function won't take my new data:
> > >
> > > predict(mymod1, newdata=data.frame(myfac=seq(0,1,length.out=8)))
> > >
> > > # Error: variable 'myfac' was fitted with class "factor" but class
> > > "numeric" was supplied
> > > # In addition: Warning message:
> > > # variable 'myfac' is not a factor in: model.frame.default(Terms,
> > > newdata, na.action = na.action, xlev = object$xlevels)
> > >
> > > # Isn't there an easier way than this?
> > >
> > > c1 <- coef(mymod1)
> > >
> > > myseq <- seq(0,1, length.out=10)
> > >
> > > newdf <- data.frame(1, myseq)
> > >
> > > linearPredictor <- as.matrix(newdf) %*% c1
> > >
> > > p2 <- 1/ (1 + exp(-linearPredictor))
> > > # But there is a big disaster if you just try the obvious thing
> > > # of plotting with
> > > # lines(myseq,p2)
> > > # The line does not show up where you hope in the plot.
> > > # The plot "appears to" have the x-axis on the scale
> > > # 1.1 to 4.5, So in order to see the s-shaped curve, it appears we have
> > > # to re-scale. However, this is a big illusion.  To see what I mean,
> > > # do
> > >
> > > points(2,.4)
> > >
> > > # you expected to see the point appear at (2,.4), but in the plot
> > > # it appears at (4.5,.4).  Huh?
> > >
> > > # The actual values being plotted are the integer-valued levels that
> > > # R uses for factors, the numbers you get from as.numeric(myfac).
> > > # So it is only necessary to re-scale the sequence by adding one.
> > >
> > > myseq2 <- 1 +  myseq
> > >
> > > lines( myseq2, p2, type="l" )
> > >
> > > #Its not all that S-shaped, but you have to take what you can get! :)
>
>
> Perhaps still too complicated but if interp is
> an interpolation function and xx are the two levels and yy
> are the corresponding linear predictors (not the responses)
> then plot xx interpolated vs the inverse link of yy
> interpolated:
>
> # interpolation function
> interp <- function(x, length = 100) seq(x[1], x[2], length = length)
>
> # levels and predictions (on linear predictor scale) from levels
> xx <- c(1.1, 4.5)
> yy <- predict(mymod1, newdata = list(myfac = factor(levs)))
>
> plot(interp(xx), family(mymod1)$linkinv(interp(yy)))
>

Sorry, I just realized that I had changed the variable name levs
to xx before posting but not in all spots so here it is again:

# interpolation function
interp <- function(x, length = 100) seq(x[1], x[2], length = length)

# levels and predictions (on linear predictor scale) from levels
xx <- c(1.1, 4.5)
yy <- predict(mymod1, newdata = list(myfac = factor(xx)))

plot(interp(xx), family(mymod1)$linkinv(interp(yy)))




More information about the R-help mailing list