[R] hypothetical prediction after polr

Dennis Murphy djmuser at gmail.com
Wed Oct 19 05:34:34 CEST 2011


Hi:

I think the problem is that you're trying to append the predicted
probabilities as a new variable in the (one-line) data frame, when in
fact a vector of probabilities is output ( = number of ordered levels
of the response) for each new observation. Here's a reproducible
example hacked from the faraway package that shows a few ways to deal
with the problem.

library("MASS")
library('faraway')

# Some messing around with the nes96 data to coarsen the factor levels
# of the response and to change income from an ordered factor to
# numeric by replacing factor levels with the midpoint incomes in
# each class. The result is still ordered.
nes96$sPID <- nes96$PID
levels(nes96$sPID) <- c('Dem', 'Dem', 'Ind', 'Ind', 'Ind', 'Rep', 'Rep')

incavg <- c(1.5,4,6,8,9.5,10.5,11.5,12.5,13.5,14.5,16,18.5,21,23.5,27.5,
          32.5,37.5,42.5,47.5,55,67.5,82.5,97.5,115)
nes96$nincome <- incavg[unclass(nes96$income)]

# Fit the model:
pomod <- polr(sPID ~ age + educ + nincome, data = nes96)

# Create a new data frame for prediction
ndat <- data.frame(age = 40, educ = 'BAdeg', nincome = mean(nincome))

# Predict: result is a vector rather than a scalar
predict(pomod, newdata = ndat, type = 'probs')
      Dem       Ind       Rep
0.3787940 0.2663598 0.3548461

# Fails because you're trying to append a new column of length three
# to a data frame with one row
ndat$predprob <- predict(pomod, newdata = ndat, type = 'probs')
Error in `$<-.data.frame`(`*tmp*`, "predprob", value = c(0.378794008534862,  :
  replacement has 3 rows, data has 1

# Better: cbind the predictions to the prediction data frame

# Method 1: for one new observation, cbind() creates a three-line data frame
cbind(ndat, predprob = predict(pomod, newdata = ndat, type = 'probs'))
    age  educ  nincome  predprob
Dem  40 BAdeg 46.57574 0.3787940
Ind  40 BAdeg 46.57574 0.2663598
Rep  40 BAdeg 46.57574 0.3548461

# Method 2: One-line data frame output for one-line newdata input
# Since predict() outputs a numeric vector,  first coerce it into a list
# and then to a one-line data frame. Then cbind() returns a one line
# data frame.
cbind(ndat,
      predprob = as.data.frame(as.list(predict(pomod, newdata = ndat,
type = 'probs'))))

#  age  educ  nincome predprob.Dem predprob.Ind predprob.Rep
#1  40 BAdeg 46.57574     0.378794    0.2663598    0.3548461

# Now try with multiple new observations:
ndat2 <- data.frame(age = c(35, 40), educ = c('HS', 'BAdeg'), nincome
= mean(nincome))
cbind(ndat2, predprob = predict(pomod, newdata = ndat2, type = 'probs'))

  age  educ  nincome predprob.Dem predprob.Ind predprob.Rep
1  35    HS 46.57574    0.4261833    0.2627280    0.3110888
2  40 BAdeg 46.57574    0.3787940    0.2663598    0.3548461

So method 2 is consistent with what you would get from predicting
multiple new observations.

HTH,
Dennis

On Tue, Oct 18, 2011 at 6:49 PM, Xu Jun <junxu.r at gmail.com> wrote:
> Dear R-Help listers,
>
> I am trying to estimate an proportional odds logistic regression model
> (or ordered logistic regression) and then make predictions by
> supplying a hypothetical x vector. However, somehow this does not
> work. I guess I must have missed something here. I first used the polr
> function in the MASS package, and I create a data frame and supply it
> to the predict function (see below):
>
> ###############################################################
> myologit <- polr(factor(warm) ~ yr89 + male + white + age + ed + prst,
>              data=ordwarm2, method=c("logistic"))
> yr89  <- c(1)
> male  <- c(1)
> white <- c(1)
>  age <- c(mean(ordwarm2$age))
>  ed  <- c(mean(ordwarm2$ed))
> prst  <- c(mean(ordwarm2$prst))
> prdata <- data.frame(yr89, male, white, age, ed, prst)
>
> prdata$rankR <-predict(myologit,newdata=prdata,type="probs")
> ################################################################
>
> I do not have any problem estimating the model, but when it comes to
> the last time (predict), I got the following message:
>
> Error in `$<-.data.frame`(`*tmp*`, "rankR", value = c(0.124294963178868,  :
>  replacement has 4 rows, data has 1
>
> Can anyone help me out with this error here? Thanks a lot!
>
> Jun Xu, Phd
> Assistant Professor
> Department of Sociology
> Ball State University
>
> P.S.: below is the detailed output from R
>
> ################################################################
>
> Call:
> polr(formula = factor(warm) ~ yr89 + male + white + age + ed +
>    prst, data = ordwarm2, method = c("logistic"))
>
> Coefficients:
>          Value Std. Error t value
> yr89   0.523912   0.079899   6.557
> male  -0.733309   0.078483  -9.344
> white -0.391140   0.118381  -3.304
> age   -0.021666   0.002469  -8.777
> ed     0.067176   0.015975   4.205
> prst   0.006072   0.003293   1.844
>
> Intercepts:
>    Value    Std. Error t value
> 1|2  -2.4654   0.2389   -10.3188
> 2|3  -0.6309   0.2333    -2.7042
> 3|4   1.2618   0.2340     5.3919
>
> Residual Deviance: 5689.825
> AIC: 5707.825
>
>> # predictions for hypothetical data points
>> # white male in year 89
>> yr89  <- c(1)
>> male  <- c(1)
>> white <- c(1)
>>   age <- c(mean(ordwarm2$age))
>>   ed  <- c(mean(ordwarm2$ed))
>> prst  <- c(mean(ordwarm2$prst))
>> prdata <- data.frame(yr89, male, white, age, ed, prst)
>>
>> prdata$rankR <-predict(myologit,newdata=prdata,type="probs")
> Error in `$<-.data.frame`(`*tmp*`, "rankR", value = c(0.124294963178868,  :
>  replacement has 4 rows, data has 1
>>
>> prdata
>  yr89 male white      age       ed     prst
> 1    1    1     1 44.93546 12.21805 39.58526
> ###############################################################3
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list