[R] Help with factor levels and reference level

Jim Lemon jim at bitwrit.com.au
Sat Jun 7 13:39:35 CEST 2014


On Fri, 6 Jun 2014 11:16:11 AM Nwinters wrote:
> I have a variable coded in Stata as follows:
> **
> *gen sat_pm25cat_=.
> replace sat_pm25cat_= 1 if (sat_pm25>=4 & sat_pm25<=7.1 & 
sat_pm25!=.)
> replace sat_pm25cat_= 2 if (sat_pm25>=7.1 & sat_pm25<=10)
> replace sat_pm25cat_= 3 if (sat_pm25>=10.1 & 
sat_pm25<=11.3)
> replace sat_pm25cat_= 4 if (sat_pm25>=11.4 & 
sat_pm25<=12.1)
> replace sat_pm25cat_= 5 if (sat_pm25>=12.2 & 
sat_pm25<=17.1)
> 
> gen satpm25catR= "A" if sat_pm25cat_==1
> replace satpm25catR= "B" if sat_pm25cat_==2
> replace satpm25catR= "C" if sat_pm25cat_==3
> replace satpm25catR= "D" if sat_pm25cat_==4
> replace satpm25catR= "E" if sat_pm25cat_==5
> ***
> 
> my model for R is:
> ##
> *glm.PM25linB <-glm(leuk ~ satpm25catR + sex + ageR, 
data=leuk,
> family=binomial, epsilon=1e-15, maxit=1000)*
> ##
> 
> In the summary, satpm25catR is being reported as all levels:
> 
> 
<http://r.789695.n4.nabble.com/file/n4691823/Screen_Shot_2014-06-06_at_2.png
> >
> 
> *What I want is to make "A" the reference level, how do I do this??*

Hi Nwinters,
I get what you want with this example:

leukdf<-
data.frame(leuk=sample(0:1,100,TRUE),sat_pm25=runif(100,0,17.1),
  sex=sample(c("M","F"),100,TRUE),ageR=sample(20:75,100,TRUE))
leukdf$satpm25catR<-factor(NA,levels=LETTERS[1:5])
leukdf$satpm25catR<-factor(rep(NA,100),levels=LETTERS[1:5])
leukdf$satpm25catR[leukdf$sat_pm25 < 7.1]<-"A"
leukdf$satpm25catR[leukdf$sat_pm25 >= 7.1 &
 leukdf$sat_pm25 < 10.1]<-"B"
leukdf$satpm25catR[leukdf$sat_pm25 >= 10.1 &
 leukdf$sat_pm25 < 11.3]<-"C"
leukdf$satpm25catR[leukdf$sat_pm25 >= 11.3 &
 leukdf$sat_pm25 < 12.1]<-"D"
leukdf$satpm25catR[leukdf$sat_pm25 >= 12.1 &
 leukdf$sat_pm25 < 17.1]<-"E"
summary(glm(leuk ~ satpm25catR + sex + ageR, data=leukdf,
 family=binomial, epsilon=1e-15, maxit=1000))

Call:
glm(formula = leuk ~ satpm25catR + sex + ageR, family = binomial, 
    data = leukdf, epsilon = 1e-15, maxit = 1000)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4813  -1.1798   0.7631   1.1347   1.5195  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.67565    0.87205   1.922   0.0547
satpm25catRB -0.52289    0.58578  -0.893   0.3721
satpm25catRC -0.79998    0.78405  -1.020   0.3076
satpm25catRD -0.36488    0.88162  -0.414   0.6790
satpm25catRE -0.65372    0.51461  -1.270   0.2040
sexM         -0.54063    0.42073  -1.285   0.1988
ageR         -0.02095    0.01455  -1.440   0.1500

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.59  on 99  degrees of freedom
Residual deviance: 133.74  on 93  degrees of freedom
AIC: 147.74

Number of Fisher Scoring iterations: 5

It may be a problem with the way you have calculated the categorical 
variable as David noted. However, if you haven't read a paper I had 
published a few years ago titled "On the perils of categorizing 
responses", you might want to have a look.

Jim



More information about the R-help mailing list