[R] logistic regression with nominal predictors

Ramón Casero Cañas 8-T at gmx.net
Tue Sep 13 23:17:28 CEST 2005

(Sorry for obvious mistakes, as I am quite a newby with no Statistics

My question is going to be what is the gain of logistic regression over
odds ratios when none of the input variables is continuous.

My experiment:
 Outcome: ordinal scale, ``quality'' (QUA=1,2,3)
 Predictors: ``segment'' (SEG) and ``stress'' (STR). SEG is
             nominal scale with 24 levels, and STR is dychotomous (0,1).

Considering the outcome continuous, two-way ANOVA with

aov(as.integer(QUA) ~ SEG * STR)

doesn't find evidence of interaction between SEG and STR, and they are
significant on their own. This is the result that we would expect from
clinical knowledge.

I use

xtabs(~QUA+SEG, data=data2.df, subset=STR==0)
xtabs(~QUA+SEG, data=data2.df, subset=STR==0)

for the contingency tables. There are zero cells, and for some values of
SEG, there is only one none-zero cell, i.e. some values of SEG determine
the output with certainty.

So initially I was thinking of a proportional odds logistic regression
model, but following Hosmer and Lemeshow [1], zero cells are
problematic. So I take out of the data table the deterministic values of
SEG, and I pool QUA=2 and QUA=3, and now I have a dychotomous outcome
(QUA = Good/Bad) and no zero cells.

The following model doesn't find evidence of interaction

glm(QUA ~ STR * SEG, data=data3.df, family=binomial)

so I go for

glm(QUA ~ STR + SEG, data=data3.df, family=binomial)

(I suppose that what glm does is to create design variables for SEG,
where 0 0 ... 0 is for the first value of SEG, 1 0 ... 0 for the second
value, 0 1 0 ... 0 for the third, etc).

              Estimate Std. Error   z value Pr(>|z|)
(Intercept) -1.085e+00  1.933e-01    -5.614 1.98e-08 ***
STR.L        2.112e-01  6.373e-02     3.314 0.000921 ***
SEGP2C.MI   -9.869e-01  3.286e-01    -3.004 0.002669 **
SEGP2C.AI   -1.306e+00  3.585e-01    -3.644 0.000269 ***
SEGP2C.AA   -1.743e+00  4.123e-01    -4.227 2.37e-05 ***
SEGP4C.ML   -5.657e-01  2.990e-01    -1.892 0.058485 .
SEGP4C.BL   -2.908e-16  2.734e-01 -1.06e-15 1.000000
SEGSAX.MS    1.092e-01  2.700e-01     0.405 0.685772
SEGSAX.MAS  -5.441e-16  2.734e-01 -1.99e-15 1.000000
SEGSAX.MA    7.130e-01  2.582e-01     2.761 0.005758 **
SEGSAX.ML    1.199e+00  2.565e-01     4.674 2.96e-06 ***
SEGSAX.MP    1.313e+00  2.570e-01     5.108 3.26e-07 ***
SEGSAX.MI    8.865e-01  2.569e-01     3.451 0.000558 ***
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3462.0  on 3123  degrees of freedom
Residual deviance: 3012.6  on 3101  degrees of freedom
AIC: 3058.6

Number of Fisher Scoring iterations: 6

Even though some coefficients have no evidence of statistical
significance, the model requires them from a clinical point of view.

At this point, the question would be how to interpret these results, and
what advantage they offer over odds ratios. From [1] I can understand
that in the case of a dychotomous and a continuous predictor, you can
adjust for the continuous variable.

But when all predictors are dychotomous (due to the design variables), I
don't quite see the effect of adjustment. Wouldn't it be better just to
split the data in two groups (STR=0 and STR=1), and instead of using
logistic regression, use odds ratios for each value of SEG?



[1] D.W. Hosmer and S. Lemeshow. ``Applied Logistic Regression''.
John-Wiley. 2000.

Ramón Casero Cañas

web:    http://www.robots.ox.ac.uk/~rcasero/

More information about the R-help mailing list