[R] logistic regression model specification

Wed Nov 14 01:58:52 CET 2007

Pardon me for intruding, but I had recently to explain something
analogous to this to a distinguished physician who had hurt himself
playing with survival models and was bleeding graphs and analyses all
over the place...

Dylan Beaudette a écrit :
> On Tuesday 13 November 2007, Prof Brian Ripley wrote:
>> On Tue, 13 Nov 2007, Dylan Beaudette wrote:
>>> Hi,
>>>
>>> I have setup a simple logistic regression model with the glm() function,
>>> with the follow formula:
>>>
>>> y ~ a + b
>>>
>>> where:
>>> 'a' is a continuous variable stratified by
>>> the levels of 'b'
>>>
>>>
>>> Looking over the manual for model specification, it seems that
>>> coefficients for unordered factors are given 'against' the first level of
>>> that factor.
> 
> Thanks for the quick reply.
> 
>> Only for the default coding.
> 
> Indeed, I should have added that to my initial message.
> 
>>> This makes for difficult interpretation when using factor 'b' as a
>>> stratifying model term.
>> Really?  You realize that you have not 'stratified' on 'b', which would
>> need the model to be a*b?  What you have is a model with parallel linear
>> predictors for different levels of 'b', and if the coefficients are not
>> telling you what you want you should change the coding.
> 
> I should have specified that interpretation was difficult, not because of the 
> default behaviour, rather my limitations and the nature of the data. Perhaps 
> an example would help.
> 
> y ~ a + b
> 
> 'a' is a continuous predictor (i.e. temperature)
> observed on the levels of 'b' (geology)

Hmmm... are they fixed (as in "reproductible" or "having well known
properties which could be expected to remain stable if the
observation/experiment is repeated") or random (i. e. label for whatever
variation is thought to be attribuable to local conditions). In other
words, are you interested in the effect of a, b being a nuisance (source
of variability), or in the effects of b by themselves ?

What glm does supposes that you are considering b as a fixed effects.
Treating it as a random effect would involve something like Douglas
Bates' lme4.

Your reference to geology leaves the interpretation open... more on this
later.

> The form of the model (or at least what I was hoping for) would account for 
> the variation in 'y' as predicted by 'a', within each level of 'b' . Am I 
> specifying this model incorrectly?
> 
>> Much of what I am trying to get across is that you have a lot of choice as
>> to how you specify a model to R. There has to be a default, which is
>> chosen because it is often a good choice.  It does rely on factors being
>> coded well: the 'base level' (to quote ?contr.treatment) needs to be
>> interpretable.  And also bear in mind that the default choices of
>> statistical software in this area almost all differ (and R's differs from
>> S, GLIM, some ways to do this in SAS ...), so people's ideas of a 'good
>> choice' do differ.
> 
> Understood. I was not implying a level of 'goodness', rather hoping to gain 
> some insight into a (possibly) mis-coded model specification.
> 
>>> Setting up the model, minus the intercept term, gives me what appear to
>>> be more meaningful coefficients. However, I am not sure if I am
>>> interpreting the results from a linear model without an intercept term.
>>> Model predictions from both specifications (with and without an intercept
>>> term) are nearly identical (different by about 1E-16 in probability
>>> space).
>>>
>>> Are there any gotchas to look out for when removing the intercept term
>>> from such a model?
>> It is just a different parametrization of the linear predictor.
>> Anything interpretable in terms of the predictions of the model will be
>> unchanged.  That is the crux: the default coefficients of 'b' will be
>> log odds-ratios that are directly interpretable, and those in the
>> per-group coding will be log-odds for a zero value of 'a'. Does a zero
>> value of 'a' make sense?
> 
> In the case of this experiment, a zero-level for 'a' does not make sense.

Then, maybe re-scaling a in a' in a way giving some useful physical
meaning to "a=0" might give you insights.

For example, let's say you are interested in the variations of a biven
biological marker with body temperature. wether you express the latter
in °C, °K or °F, your statistical inferences about the phenomenon will
have the same validity. However, expressing them in (°C-37) give the
coefficients a very simple, interesting and useful meaning : they
express the behaviour of your marker at physiologically normal body
temperature. This reference is indeed useful.

Bill Venables expressed this idea (and other very important ones) in a
nice conference paper entitled (approximately) "Exegeses on the linear
model", whose reference I've lost (again !). In this paper, he suggests
that "centering" around the median is often useful (i.e a'<-a-median(a)).

> Further thoughts welcomed.

Well... a bit on the parametrisation : if you use R's default of
"contrast treatment", your coefficient will express differences between
the coefficient named and the reference one (the one you don't find a
coefficient in the "summary(glm...)". This is easy to interpret.

If you use, for example, "contrast.helmert" (which was (still is ?) the
default in S-plus), your coefficients are something totally different :
they are choosen in a way ensuring that the sum of the coefficients are
0 for each row of recoded data. This choice is popular with some,
because they allow for direct calculation of so-called "Type III sum of
squares" so refered by the FDA and despised by Bill Venables and Brian
Ripley. Other choices are possible ad might be useful in different contexts.

Now for "stratification".

You wrote your model as "y~a+b". This can be translated as "the (log
odds-ratio of the) effect fluctuates around a central value which is the
sum of a grand mean (the intercept), a factor specific to the particular
geology and a quantity proportional to temperature, this last
proportionality factor being the same for *all* geologies".

In other words, you get tools allowing you to predict y in various
circumstances :
an intercept (a "grand mean"), representing the *expectation* for y when
a=0 and b is neglected ;
a set of corrections for each b (minus the reference, as explained
before), allowing you to have a "better" expectation for a=0 ;
a coefficient of correction allowing you to refine your expectation when
the temperature a is now known.

If you wanted to express the idea that this factor might vary between
geologies, you would write "y~a*b", which (as MASS will explain you in
detail) is equivalent to write "y~a+b+a:b", which can be translated (in
poor English) as "the (log odds-ratio of the) effect fluctuates around a
central value which is the sum of a grand mean (the intercept), a factor
specific to the particular geology, a quantity proportional to
temperature (the same for *all* geologies) AND a correction to this
proportionality factor specific to each geology times temperature".

This will give you :
a grand mean, as before, for a=0 and b neglected ;
a set of differences between each b and the reference b (for a=0, of
course) ;
a "gross slope" of the line passing "at best" through the cloud
representing your observation in a (a,y) graph,, allowing you to refine
your prediction when a is known, but ignoring b
and a correction factor for this slope for each b.

This is maybe not the easiest way to express this relationship. You
might be more interested in each particular slope and not by a "grand
slope". This can be expressed by "y~b+(a %in% b)", which is equivalent
to "y~a+a:b", which gives you directly a grand mean, a correction for
the group and each slope.

You could also have no interest in the "grand mean" but only in each
particular mean : "y~a*b-1" where the "-1" term specifies to suppress
the intercept (the "grand mean"), which will therefore be added to each
difference in b, thus giving directly means for a=0.

You migh even be interested in getting just a mean and a slope for each
group  "y~a+a:b-1", which combine the two modifications made before.

But note that if yoou need to test for an effect of a or an effect of b,
the models where you suppress the "grand mean" or "grand slope" will
give you difficulties : the hypotheses you will be testing will be
different of what is usually tested. This is well explained in MASS 4
(with more details in the "Linear Statistical Models" chapter).

Please also note that I abused the meaning of "prediction" : to be able
to predict an expectation is not sufficient, and predict.lm() and such
are here for a good reason...

*******

A word about "random effects" :

I supposed until now that you are indeed interested in the effect of b
("geology") on y because b is some reproductible feature of your terrain
(e. g. a particular level of b might be "Hercynian fold", and you expect
to extract an information that will remain valid when going to another
site with the same characteristics (e. g. learning something in the Alps
and using this knowledge in the Pyrenneans).

Now it could be something entirely different. To make myself clear, I
will return to biology.

Suppose that I am interested by the relation between body temperature
and a biological marker. I might be interested to know if this
relationship varies if my atien is male or female : this is a
characteristic I can easily observe if I do again the experiment with
another subject (or I should change professions pronto ...), and I can
expect that the "sex effect" eventually observed on a random sample of a
given population will be re-observed when extracting another sample of
the same population.

Now, I might also observe that the relation between my marker and body
temperature may vary with "familial characteristics", i. e. :
1) this relation varies from people to people, but
2) this variation is less important between people coming from the same
family than between unrelated people.

The point is that this characteristic (the "family") cannot be
*observed* (or *measured*) directly, and is but a *label*. Therefore, if
I observe this effect in a set of families choosen at random from a
given population, I cannot expect to use this knowledge when studying
another set of families extracted from the same population.

In other words, I just known that the variability ("distance", in other
words) between unrelated people is larger than the variability between
related people.

It is rather tempting to decompose this variability in two parts :
"between families" and "within families", thus "explaining away" a part
of this variability and getting better estimates of the relationship of
interest (i.e. the temperature-marker relationship). This is similar,
but in no ways identical, to the separation between the effects of
temperature and effects of sex one should do (ANCOVA) in the previous
example : the "family" effect is random, you cannot observe it directly,
and you have to estimate the variability it adds to the main efect.

That is what "mixed models" aim to do. The way to separate fixed effects
(in pour case the temperature-larker relationship) from random
effects("between families" variation) from residual variability
(exemplified by the "within families" variability) is somewhat intricate.

I recommend reading Pinhero and Bates (2000), referenced in the "Books"
page of the main R site, and the introduction to the lme4 package. But
be aware that this is a bit more intricate than simple (generalized)
linear models...

In your case, your "geology" factor may or may not be such a label. You
have to look at the matter at hand to make a choice.

Hope this helps,

					Emmanuel Charpentier