[R] logistic regression model specification

Wed Nov 14 19:03:19 CET 2007

On Nov 13, 2007 4:58 PM, Emmanuel Charpentier
<charpent at bacbuc.dyndns.org> wrote:
> Pardon me for intruding, but I had recently to explain something
> analogous to this to a distinguished physician who had hurt himself
> playing with survival models and was bleeding graphs and analyses all
> over the place...

Thanks for taking the time to reply. I would rather be embarrassed now
than later!

> Dylan Beaudette a écrit :
>
> > On Tuesday 13 November 2007, Prof Brian Ripley wrote:
> >> On Tue, 13 Nov 2007, Dylan Beaudette wrote:
> >>> Hi,
> >>>
> >>> I have setup a simple logistic regression model with the glm() function,
> >>> with the follow formula:
> >>>
> >>> y ~ a + b
> >>>
> >>> where:
> >>> 'a' is a continuous variable stratified by
> >>> the levels of 'b'
> >>>
> >>>
> >>> Looking over the manual for model specification, it seems that
> >>> coefficients for unordered factors are given 'against' the first level of
> >>> that factor.
> >
> > Thanks for the quick reply.
> >
> >> Only for the default coding.
> >
> > Indeed, I should have added that to my initial message.
> >
> >>> This makes for difficult interpretation when using factor 'b' as a
> >>> stratifying model term.
> >> Really?  You realize that you have not 'stratified' on 'b', which would
> >> need the model to be a*b?  What you have is a model with parallel linear
> >> predictors for different levels of 'b', and if the coefficients are not
> >> telling you what you want you should change the coding.
> >
> > I should have specified that interpretation was difficult, not because of the
> > default behaviour, rather my limitations and the nature of the data. Perhaps
> > an example would help.
> >
> > y ~ a + b
> >
> > 'a' is a continuous predictor (i.e. temperature)
> > observed on the levels of 'b' (geology)
>
> Hmmm... are they fixed (as in "reproductible" or "having well known
> properties which could be expected to remain stable if the
> observation/experiment is repeated") or random (i. e. label for whatever
> variation is thought to be attribuable to local conditions). In other
> words, are you interested in the effect of a, b being a nuisance (source
> of variability), or in the effects of b by themselves ?

This gets at the real heart of the hypothesis: the logit of my
response (a soil taxonomic feature) is directly related to a energetic
parameter (temperature, solar radiance, etc.), but the relationship is
not necessarily the same across geologic type. Since we do not have an
exhaustive set of data which would encompass all of the parameters
that geology would influence, we are using it as a stratifying
variable. However, interpretation of the coefficients associated with
the levels of geology might be give some other insight.

> What glm does supposes that you are considering b as a fixed effects.
> Treating it as a random effect would involve something like Douglas
> Bates' lme4.
>
> Your reference to geology leaves the interpretation open... more on this
> later.

ok -- i do not really want to use geology as a 'treatment' per se ...
however I see where this may be leading.

>
> > The form of the model (or at least what I was hoping for) would account for
> > the variation in 'y' as predicted by 'a', within each level of 'b' . Am I
> > specifying this model incorrectly?
> >
> >> Much of what I am trying to get across is that you have a lot of choice as
> >> to how you specify a model to R. There has to be a default, which is
> >> chosen because it is often a good choice.  It does rely on factors being
> >> coded well: the 'base level' (to quote ?contr.treatment) needs to be
> >> interpretable.  And also bear in mind that the default choices of
> >> statistical software in this area almost all differ (and R's differs from
> >> S, GLIM, some ways to do this in SAS ...), so people's ideas of a 'good
> >> choice' do differ.
> >
> > Understood. I was not implying a level of 'goodness', rather hoping to gain
> > some insight into a (possibly) mis-coded model specification.
> >
> >>> Setting up the model, minus the intercept term, gives me what appear to
> >>> be more meaningful coefficients. However, I am not sure if I am
> >>> interpreting the results from a linear model without an intercept term.
> >>> Model predictions from both specifications (with and without an intercept
> >>> term) are nearly identical (different by about 1E-16 in probability
> >>> space).
> >>>
> >>> Are there any gotchas to look out for when removing the intercept term
> >>> from such a model?
> >> It is just a different parametrization of the linear predictor.
> >> Anything interpretable in terms of the predictions of the model will be
> >> unchanged.  That is the crux: the default coefficients of 'b' will be
> >> log odds-ratios that are directly interpretable, and those in the
> >> per-group coding will be log-odds for a zero value of 'a'. Does a zero
> >> value of 'a' make sense?
> >
> > In the case of this experiment, a zero-level for 'a' does not make sense.
>
> Then, maybe re-scaling a in a' in a way giving some useful physical
> meaning to "a=0" might give you insights.
>
> For example, let's say you are interested in the variations of a biven
> biological marker with body temperature. wether you express the latter
> in °C, °K or °F, your statistical inferences about the phenomenon will
> have the same validity. However, expressing them in (°C-37) give the
> coefficients a very simple, interesting and useful meaning : they
> express the behaviour of your marker at physiologically normal body
> temperature. This reference is indeed useful.
>
> Bill Venables expressed this idea (and other very important ones) in a
> nice conference paper entitled (approximately) "Exegeses on the linear
> model", whose reference I've lost (again !). In this paper, he suggests
> that "centering" around the median is often useful (i.e a'<-a-median(a)).

I will try and find this reference...

> > Further thoughts welcomed.
>
> Well... a bit on the parametrisation : if you use R's default of
> "contrast treatment", your coefficient will express differences between
> the coefficient named and the reference one (the one you don't find a
> coefficient in the "summary(glm...)". This is easy to interpret.

Ok. Between posting to the list, and reading some more in MASS and the
R manual I think that I can understand this part...

> If you use, for example, "contrast.helmert" (which was (still is ?) the
> default in S-plus), your coefficients are something totally different :
> they are choosen in a way ensuring that the sum of the coefficients are
> 0 for each row of recoded data. This choice is popular with some,
> because they allow for direct calculation of so-called "Type III sum of
> squares" so refered by the FDA and despised by Bill Venables and Brian
> Ripley. Other choices are possible ad might be useful in different contexts.

I tried this, however I am not really interested in the Type III SS value.

> Now for "stratification".
>
> You wrote your model as "y~a+b". This can be translated as "the (log
> odds-ratio of the) effect fluctuates around a central value which is the
> sum of a grand mean (the intercept), a factor specific to the particular
> geology and a quantity proportional to temperature, this last
> proportionality factor being the same for *all* geologies".

same slope, different intercepts (?)

> In other words, you get tools allowing you to predict y in various
> circumstances :
> an intercept (a "grand mean"), representing the *expectation* for y when
> a=0 and b is neglected ;
> a set of corrections for each b (minus the reference, as explained
> before), allowing you to have a "better" expectation for a=0 ;
> a coefficient of correction allowing you to refine your expectation when
> the temperature a is now known.
>

ok -- this sounds about right for the original model design... however
I am starting to wonder if another form is worth trying.

> If you wanted to express the idea that this factor might vary between
> geologies, you would write "y~a*b", which (as MASS will explain you in
> detail) is equivalent to write "y~a+b+a:b", which can be translated (in
> poor English) as "the (log odds-ratio of the) effect fluctuates around a
> central value which is the sum of a grand mean (the intercept), a factor
> specific to the particular geology, a quantity proportional to
> temperature (the same for *all* geologies) AND a correction to this
> proportionality factor specific to each geology times temperature".
>
> This will give you :
> a grand mean, as before, for a=0 and b neglected ;
> a set of differences between each b and the reference b (for a=0, of
> course) ;
> a "gross slope" of the line passing "at best" through the cloud
> representing your observation in a (a,y) graph,, allowing you to refine
> your prediction when a is known, but ignoring b
> and a correction factor for this slope for each b.

Hmmm. All of the marginal t-tests for this model come out not significant...

> This is maybe not the easiest way to express this relationship. You
> might be more interested in each particular slope and not by a "grand
> slope". This can be expressed by "y~b+(a %in% b)", which is equivalent
> to "y~a+a:b", which gives you directly a grand mean, a correction for
> the group and each slope.

This is an interesting option. I will try this out.

> You could also have no interest in the "grand mean" but only in each
> particular mean : "y~a*b-1" where the "-1" term specifies to suppress
> the intercept (the "grand mean"), which will therefore be added to each
> difference in b, thus giving directly means for a=0.
>
> You migh even be interested in getting just a mean and a slope for each
> group  "y~a+a:b-1", which combine the two modifications made before.

ok. how about this one:

# suggested in MASS
y ~ b/a - 1

this would give me a regression for each level of geology (?). Would
this allow me to ask the question: is the response relative to 'a'
different for each level of 'b' ?

> But note that if yoou need to test for an effect of a or an effect of b,
> the models where you suppress the "grand mean" or "grand slope" will
> give you difficulties : the hypotheses you will be testing will be
> different of what is usually tested. This is well explained in MASS 4
> (with more details in the "Linear Statistical Models" chapter).

Thanks. I will take a look at that chapter.

> Please also note that I abused the meaning of "prediction" : to be able
> to predict an expectation is not sufficient, and predict.lm() and such
> are here for a good reason...

Right.

> *******
>
> A word about "random effects" :
>
> I supposed until now that you are indeed interested in the effect of b
> ("geology") on y because b is some reproductible feature of your terrain
> (e. g. a particular level of b might be "Hercynian fold", and you expect
> to extract an information that will remain valid when going to another
> site with the same characteristics (e. g. learning something in the Alps
> and using this knowledge in the Pyrenneans).

That would be ideal, and somewhat contingent on a more robust sampling
scheme. However, in this case I would just like to recognize that the
influence of 'a' on logit(y) is different across levlels of 'b' --
geology.

> Now it could be something entirely different. To make myself clear, I
> will return to biology.
>
> Suppose that I am interested by the relation between body temperature
> and a biological marker. I might be interested to know if this
> relationship varies if my atien is male or female : this is a
> characteristic I can easily observe if I do again the experiment with
> another subject (or I should change professions pronto ...), and I can
> expect that the "sex effect" eventually observed on a random sample of a
> given population will be re-observed when extracting another sample of
> the same population.
>
> Now, I might also observe that the relation between my marker and body
> temperature may vary with "familial characteristics", i. e. :
> 1) this relation varies from people to people, but
> 2) this variation is less important between people coming from the same
> family than between unrelated people.
>
> The point is that this characteristic (the "family") cannot be
> *observed* (or *measured*) directly, and is but a *label*. Therefore, if
> I observe this effect in a set of families choosen at random from a
> given population, I cannot expect to use this knowledge when studying
> another set of families extracted from the same population.
>
> In other words, I just known that the variability ("distance", in other
> words) between unrelated people is larger than the variability between
> related people.
>
> It is rather tempting to decompose this variability in two parts :
> "between families" and "within families", thus "explaining away" a part
> of this variability and getting better estimates of the relationship of
> interest (i.e. the temperature-marker relationship). This is similar,
> but in no ways identical, to the separation between the effects of
> temperature and effects of sex one should do (ANCOVA) in the previous
> example : the "family" effect is random, you cannot observe it directly,
> and you have to estimate the variability it adds to the main efect.
>
> That is what "mixed models" aim to do. The way to separate fixed effects
> (in pour case the temperature-larker relationship) from random
> effects("between families" variation) from residual variability
> (exemplified by the "within families" variability) is somewhat intricate.
>
> I recommend reading Pinhero and Bates (2000), referenced in the "Books"
> page of the main R site, and the introduction to the lme4 package. But
> be aware that this is a bit more intricate than simple (generalized)
> linear models...
>
> In your case, your "geology" factor may or may not be such a label. You
> have to look at the matter at hand to make a choice.
>
> Hope this helps,
>

Yes. This was very helpful, thank you for taking the time to type it
all into an email!

Cheers,

Dylan

>                                         Emmanuel Charpentier
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>