[R] reference category for factor in regression

Tue Jan 20 08:23:08 CET 2009

G'day Jos,

On Mon, 19 Jan 2009 20:22:10 +0000
Jos Elkink <jos.elkink at ucd.ie> wrote:

> Here is a little bit more R code to show the problem:

Thanks for that, all becomes clear now. :)

> > str(AGE)
>  Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...
> > table(LABOUR)
> LABOUR
>    0    1
>  692 1409
> > NONLABOUR <- 1 - LABOUR
> > m <- glm(NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE : LABOUR + AGE : NONLABOUR, family=binomial)
> > m
> 
> Call:  glm(formula = NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE:LABOUR +
>   AGE:NONLABOUR, family = binomial)
> 
> Coefficients:
>             LABOUR           NONLABOUR       LABOUR:AGE65+     LABOUR:AGE18-24
>           -0.35110            -0.30486            -0.11890            -0.66444
>    LABOUR:AGE25-34     LABOUR:AGE35-49     LABOUR:AGE50-64  NONLABOUR:AGE18-24
>           -0.23893            -0.15860                  NA            -0.65655
> NONLABOUR:AGE25-34  NONLABOUR:AGE35-49  NONLABOUR:AGE50-64
>           -0.72815             0.04951             0.17481
> 
> As you can see, 65+ is taken as reference category in the interaction
> with NONLABOUR, but not in the interaction with LABOUR.

The way R translates model formulae into design matrices can appear as
a somewhat mysterious process.  It is claimed that the best account on
how this is done can be found in the MASS book.  

Essentially, if A is a factor with r levels, then if A appears as a
main term in the model it will contribute X_A (the matrix with r
columns that contains the dummy variables corresponding to the levels
of A) to the design matrix.  If A appears in a two-way interaction term,
then the contribution to the design matrix is the matrix obtained by
taking each column of X_A and multiplying it elementwise with each
column of the matrix contributed by the other variable.  And so forth
for higher order interactions.

Now, in general adding X_A to the design matrix will lead to a design
matrix that does not have full rank.  Thus, typically X_A is first
multiplied by a contrast matrix, say, C_A and X_A C_A is actually added
to the design matrix.  The contrast matrix used depends on your
settings of options()$contrasts.  Likewise X_A C_A is typically used in
the construction of the interaction terms and not X_A.

But if you play around with "+0" (or "-1"), then things seem to become
more complicated.  E.g. if +0 is in the model, then for the first
factor that is added as a main term, X_A is put into the design matrix,
not X_A C_A.  But a second factor, say, B would contribute X_B C_B to
the design matrix since adding X_B would lead again to a design matrix
that does not have full column rank; so some reduction is done
immediately.

MASS does not really discuss the situation when you use two
(apparently) quantitative variables in your model and one factor and
the factor only appears in the interaction terms (well, at least not the
edition of MASS that I have currently in front of me, which is an older
one; my copy of the most recent edition is in a colleague's office),
but it appears plausible that the logic of how the design matrix is
constructed uses X_A in the AGE:LABOUR interaction and X_A C_A in the
AGE:NONLABOUR interaction.  Or there is a subtle error in logic in how
the design matrix is constructed. 

Of course due to the columns contributed by LABOUR and AGE:LABOUR, the
design matrix has no longer full rank.  This is noticed while the model
is fitted and, hence, the LABOUR:AGE50-64 parameter is not estimable
and a NA is returned.

BTW, if you use

glm(NOVOTE ~ LABOUR + NONLABOUR + AGE : LABOUR + AGE : NONLABOUR, family=binomial)

Then the interaction terms are as you expect (if I understand you
correctly).  But now, of course, the NONLABOUR parameter is not
estimable.

> I know glm(NOVOTE ~ LABOUR * AGE, family=binomial) would be a more
> conventional specification, but the above should be equivalent and
> should give me the coefficients and standard errors for the two groups
> (LABOUR and NONLABOUR) separately, rather than for the difference /
> interaction term).

If I understand you correctly, I would do the following to get a fitted
model that directly gives you the estimates and standard errors that you
are interested in:

1) Turn LABOUR into a factor to have R do all the work with creating
the design matrix:

R> FLABOUR <- factor(LABOUR, labels=c("NO", "YES"))

2) Use any of the following commands:

R> glm(NOVOTE ~ FLABOUR/AGE - 1, family=binomial)
R> glm(NOVOTE ~ FLABOUR/AGE + 0, family=binomial)
R> glm(NOVOTE ~ FLABOUR + FLABOUR:AGE - 1, family=binomial)
R> glm(NOVOTE ~ FLABOUR + FLABOUR:AGE + 0, family=binomial)

Somehow, I have a preference of using "-1" instead of "+0", but that
does not really matter; it also does not matter whether these appear at
the beginning or the end (or the middle of the formula).  The first two
forms are, of course, equivalent to the latter two due to the way "/"
is interpreted.  I prefer the first version (less typing), but the
third or fourth version might be clearer to others.

HTH.

Cheers,

	Berwin

=========================== Full address =============================
Berwin A Turlach                            Tel.: +65 6516 4416 (secr)
Dept of Statistics and Applied Probability        +65 6516 6650 (self)
Faculty of Science                          FAX : +65 6872 3919       
National University of Singapore     
6 Science Drive 2, Blk S16, Level 7          e-mail: statba at nus.edu.sg
Singapore 117546                    http://www.stat.nus.edu.sg/~statba