[R] Behavior of ordered factors in glm

David Winsemius dwinsemius at comcast.net
Sun Jan 6 01:16:47 CET 2008


David Winsemius <dwinsemius at comcast.net> wrote in
news:Xns9A1CC05755274dNOTwinscomcast at 80.91.229.13: 

> I have a variable which is roughly age categories in decades. In the
> original data, it came in coded:
>> str(xxx)
> 'data.frame':   58271 obs. of  29 variables:
>  $ issuecat   : Factor w/ 5 levels "0 - 39","40 - 49",..: 1 1  1
>  1... 
> snip
> 
> I then defined issuecat as ordered:
>> xxx$issuecat<-as.ordered(xxx$issuecat)
> 
> When I include issuecat in a glm model, the result makes me think I 
> have asked R for a linear+quadratic+cubic+quartic polynomial fit.
> The results are not terribly surprising under that interpretation,
> but I was hoping for only a linear term (which I was taught to
> call a "test of trend"), at least as a starting point.
> 
>> age.mdl<-glm(actual~issuecat,data=xxx,family="poisson")
>> summary(age.mdl)
> 
> Call:
> glm(formula = actual ~ issuecat, family = "poisson", data = xxx)
> 
> Deviance Residuals: 
>     Min       1Q   Median       3Q      Max  
> -0.3190  -0.2262  -0.1649  -0.1221   5.4776  
> 
> Coefficients:
>             Estimate Std. Error z value Pr(>|z|)    
> (Intercept) -4.31321    0.04865 -88.665   <2e-16 ***
> issuecat.L   2.12717    0.13328  15.960   <2e-16 ***
> issuecat.Q  -0.06568    0.11842  -0.555    0.579    
> issuecat.C   0.08838    0.09737   0.908    0.364    
> issuecat^4  -0.02701    0.07786  -0.347    0.729 
> 
> This also means my advice to a another poster this morning may have 
> been misleading. I have tried puzzling out what I don't understand
> by looking at indices or searching in MASSv2, the Blue Book,
> Thompson's application of R to Agresti's text, and the FAQ, so far
> without success. What I would like to achieve is having the lowest
> age category be a reference category (with the intercept being the
> log-rate) and each succeeding age category  be incremented by 1. The
> linear estimate would be the log(risk-ratio) for increasing ages. I
> don't want the higher order polynomial estimates. Am I hoping for
> too much? 
> 

I acheived what I needed by:

> xxx$agecat<-as.numeric(xxx$issuecat)
> xxx$agecat<-xxx$agecat-1

The results look quite sensible:
> exp.mdl<-glm(actual~gendercat+agecat+smokecat, data=xxx, 
family="poisson", offset=expected)
> summary(exp.mdl)

Call:
glm(formula = actual ~ gendercat + agecat + smokecat, family = 
"poisson", 
    data = xxx, offset = expected)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5596  -0.2327  -0.1671  -0.1199   5.2386  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -5.89410    0.11009 -53.539  < 2e-16 ***
gendercatMale    0.29660    0.06426   4.615 3.92e-06 ***
agecat           0.66143    0.02958  22.360  < 2e-16 ***
smokecatSmoker   0.22178    0.07870   2.818  0.00483 ** 
smokecatUnknown  0.02378    0.08607   0.276  0.78233

I remain curious about how to correctly control ordered factors, or I 
should just simply avoid them.

-- 
David Winsemius




More information about the R-help mailing list