[R] logistic regression with nominal predictors

Tue Sep 13 23:17:28 CEST 2005

(Sorry for obvious mistakes, as I am quite a newby with no Statistics
background).

My question is going to be what is the gain of logistic regression over
odds ratios when none of the input variables is continuous.

My experiment:
 Outcome: ordinal scale, ``quality'' (QUA=1,2,3)
 Predictors: ``segment'' (SEG) and ``stress'' (STR). SEG is
             nominal scale with 24 levels, and STR is dychotomous (0,1).

Considering the outcome continuous, two-way ANOVA with

aov(as.integer(QUA) ~ SEG * STR)

doesn't find evidence of interaction between SEG and STR, and they are
significant on their own. This is the result that we would expect from
clinical knowledge.

I use

xtabs(~QUA+SEG, data=data2.df, subset=STR==0)
xtabs(~QUA+SEG, data=data2.df, subset=STR==0)

for the contingency tables. There are zero cells, and for some values of
SEG, there is only one none-zero cell, i.e. some values of SEG determine
the output with certainty.

So initially I was thinking of a proportional odds logistic regression
model, but following Hosmer and Lemeshow [1], zero cells are
problematic. So I take out of the data table the deterministic values of
SEG, and I pool QUA=2 and QUA=3, and now I have a dychotomous outcome
(QUA = Good/Bad) and no zero cells.

The following model doesn't find evidence of interaction

glm(QUA ~ STR * SEG, data=data3.df, family=binomial)

so I go for

glm(QUA ~ STR + SEG, data=data3.df, family=binomial)

(I suppose that what glm does is to create design variables for SEG,
where 0 0 ... 0 is for the first value of SEG, 1 0 ... 0 for the second
value, 0 1 0 ... 0 for the third, etc).

Coefficients:
              Estimate Std. Error   z value Pr(>|z|)
(Intercept) -1.085e+00  1.933e-01    -5.614 1.98e-08 ***
STR.L        2.112e-01  6.373e-02     3.314 0.000921 ***
SEGP2C.MI   -9.869e-01  3.286e-01    -3.004 0.002669 **
SEGP2C.AI   -1.306e+00  3.585e-01    -3.644 0.000269 ***
SEGP2C.AA   -1.743e+00  4.123e-01    -4.227 2.37e-05 ***
[shortened]
SEGP4C.ML   -5.657e-01  2.990e-01    -1.892 0.058485 .
SEGP4C.BL   -2.908e-16  2.734e-01 -1.06e-15 1.000000
SEGSAX.MS    1.092e-01  2.700e-01     0.405 0.685772
SEGSAX.MAS  -5.441e-16  2.734e-01 -1.99e-15 1.000000
SEGSAX.MA    7.130e-01  2.582e-01     2.761 0.005758 **
SEGSAX.ML    1.199e+00  2.565e-01     4.674 2.96e-06 ***
SEGSAX.MP    1.313e+00  2.570e-01     5.108 3.26e-07 ***
SEGSAX.MI    8.865e-01  2.569e-01     3.451 0.000558 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3462.0  on 3123  degrees of freedom
Residual deviance: 3012.6  on 3101  degrees of freedom
AIC: 3058.6

Number of Fisher Scoring iterations: 6

Even though some coefficients have no evidence of statistical
significance, the model requires them from a clinical point of view.

At this point, the question would be how to interpret these results, and
what advantage they offer over odds ratios. From [1] I can understand
that in the case of a dychotomous and a continuous predictor, you can
adjust for the continuous variable.

But when all predictors are dychotomous (due to the design variables), I
don't quite see the effect of adjustment. Wouldn't it be better just to
split the data in two groups (STR=0 and STR=1), and instead of using
logistic regression, use odds ratios for each value of SEG?

Cheers,

Ramón.

[1] D.W. Hosmer and S. Lemeshow. ``Applied Logistic Regression''.
John-Wiley. 2000.

-- 
Ramón Casero Cañas

web:    http://www.robots.ox.ac.uk/~rcasero/