[R] questions about co-linearity in logistic regression from Stefano Sofia

Tue Aug 2 17:41:47 CEST 2016

Stefano,

It is usually best to keep these discussions on R-help.  People there
may be quicker to respond and could have better answers. Keeping the
discussion on the list also means that if others in the future find
your question, they will also find the answers and discussion.  And
some of us can spend some time answering questions on the list as a
community stewardship contribution, but when asked directly it turns
into consulting and we would need to start charging a fee (and I
expect that sharing your answer with the community as a whole is lot
cheaper than what my employer would insist that I charge you as a
consultant).

Quick answers to your questions:

1. an example of the data:

> tmp.dat <- data.frame(color=factor(c('red','green','blue'),
+ levels=c('red','green','blue'))
+ )
> model.matrix(~color-1, data=tmp.dat)
  colorred colorgreen colorblue
1        1          0         0
2        0          1         0
3        0          0         1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$color
[1] "contr.treatment"

2.  There is co-linearity with the intercept because the intercept is
represented by a column of 1's and you can see that if you add the 3
columns above you also will see a column of 1's.

3. There are 3 pieces of information (3 colors) so you need 3 columns
to represent those, you can reconstruct the "blue" column by
subtracting "red" and "green" from the column of 1's that represent
the intercept.  Dropping the last 2 columns only leaves 2 pieces left
which will not contain all the information (just "red" and the
intercept, "blue" and "green" would be combined in that case).  Any
column could be dropped (including the intercept), R/S just chose to
drop the last one by default.

4.  The full answer to this would involve studying the reparameterized
cell-means model:
Bryce, G. Rex, Del T. Scott, and M. W. Carter. "ESTIMATION AND
HYPOTHESIS-TESTING IN LINEAR-MODELS-REPARAMETERIZATION APPROACH TO THE
CELL MEANS MODEL." COMMUNICATIONS IN STATISTICS PART A-THEORY AND
METHODS 9.2 (1980): 131-150.

but the quick answer can be seen by the parameterization using the
intercept and dropping a column for "blue":

> tmp.dat <- data.frame(color=factor(c('red','green','blue')))
> model.matrix(~color, data=tmp.dat)
  (Intercept) colorgreen colorred
1           1          0        1
2           1          1        0
3           1          0        0
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$color
[1] "contr.treatment"

> solve(.Last.value)
                 1 2  3
(Intercept)   0 0  1
colorgreen   0 1 -1
colorred      1 0 -1

You can see that the rows corresponding to "blue" have only the
intercept as non-zero while the other rows have the intercept plus
another column as 1's, In the contrast matrix (the result of `solve`)
the intercept corresponds exactly to 3/blue and the others are the
difference between blue and the other 2 colors.  Even with the dummy
variables you can see that when predicting red you take the intercept
plus the coefficient for red, when predicting green you take the
intercept plus the coefficient for green, and when predicting for blue
you just use the intercept.  This should intuitively suggest that the
intercept is the mean/prediction for blue and the other coefficients
are the differences to be added to "blue" to get the other values.

5. see answer to 4

On Mon, Aug 1, 2016 at 4:18 AM, Stefano Sofia
<stefano.sofia at regione.marche.it> wrote:
> Dear Dr. Snow,
> my name is Stefano Sofia (I am a meteorologist), I always read posts on the
> r-mailing list.
>
> Few days ago I read with great interest your answer about "Why the order of
> parameters in a logistic regression affects results significantly?"
>
> I am very interested in regression, I have some basis about it but not so
> solid to understand the considerations you mentioned.
> Therefore may I ask you some questions (only if you have got time), or a
> reference text where I can find the answers?
>
> Sorry for the disturb, and thank you for your help.
> Stefano Sofia PhD
>
> Here my questions.
>
> 1. Could you please give me a very short example of three predictors (red,
> green and blue) that are indicators variables with a 1 in exactly one of
> those variables?
> 2. Why in this case there is co-linearity with the intercept?
> 3. Why in case of co-linearity only the last variable is removed (the blue)
> and not the last two ones?
> 4. Why the intercept is the average for blue?
> 5. And finally why the coefficients are the differences between red/green
> and blue on average?
>
> Here there is your original e-mail:
> "...
> The fact that the last coefficient is NA in both outputs suggests that
> there was some co-linearity in your predictor variables and R chose to
> drop one of the offending variables from the model (the last one in
> each case).  Depending on the nature of the co-linearity, the
> interpretation (and therefore the estimates) can change.
>
> For example lets say that you have 3 predictors, red, green, and blue
> that are indicator variables (0/1) and that every subject has a 1 in
> exactly one of those variables (so they are co-linear with the
> intercept).  If you put the 3 variables into a model with the
> intercept in the above order, then R will drop the blue variable and
> the interpretation of the coefficients is that the intercept is the
> average for blue subjects and the other coefficients are the
> differences between red/green and blue on average.  If you refit the
> model with the order blue, green, red, then R will drop red from the
> model and now the interpretation is that the intercept is the mean for
> red subjects and the others are the differences from red on average, a
> very different interpretation and therefore different estimates.
>
> I expect something along those lines is going on here."
>
>
> ________________________________
>
> AVVISO IMPORTANTE: Questo messaggio di posta elettronica può contenere
> informazioni confidenziali, pertanto è destinato solo a persone autorizzate
> alla ricezione. I messaggi di posta elettronica per i client di Regione
> Marche possono contenere informazioni confidenziali e con privilegi legali.
> Se non si è il destinatario specificato, non leggere, copiare, inoltrare o
> archiviare questo messaggio. Se si è ricevuto questo messaggio per errore,
> inoltrarlo al mittente ed eliminarlo completamente dal sistema del proprio
> computer. Ai sensi dell’art. 6 della DGR n. 1394/2008 si segnala che, in
> caso di necessità ed urgenza, la risposta al presente messaggio di posta
> elettronica può essere visionata da persone estranee al destinatario.
> IMPORTANT NOTICE: This e-mail message is intended to be received only by
> persons entitled to receive the confidential information it may contain.
> E-mail messages to clients of Regione Marche may contain information that is
> confidential and legally privileged. Please do not read, copy, forward, or
> store this message unless you are an intended recipient of it. If you have
> received this message in error, please forward it to the sender and delete
> it completely from your computer system.

-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com