[R] CORRECTION: Re: Multicollinearity with brglm?

Ioannis Kosmidis I.Kosmidis at warwick.ac.uk
Thu Apr 2 16:56:00 CEST 2009


Thanks for your mail.  I guess that the constant row sum on X would create 
problems in a simulation framework because you might end up with linearly 
dependent columns or even with columns of zeros (which I believe do not make 
much sense).

First of all, I think there is a problem with your example below. For this X 
two columns should be eliminated if a constant is to be included in the model 
and in  summary(mod.simple.brglm) only one appears to be eliminated.

The reason for eliminating columns is merely to report a parameterization that 
is identifiable. 

For example, consider a single binomial variable with 
observed value 2 and total number of trials 10. Also, let's suppose that we 
are interested on the log-odds of success beta1.  The 
estimated log-odds for this sample is

hat{beta1} = -1.386

so that the fitted probability is 0.2.

If another constant, say beta2, is introduced in the model then 
there is a whole infinity of values that the vector (beta1,beta2) can take 
for giving fitted probability 0.2 (like for example (-1 , -0.386) or 
(-10^8 , 10^8 - 1.386) and no choice is better than another.  So glm chooses 
to eliminate one of the two constants in order to get an identifiable 
parameterization for which for a specific value of beta1 there corresponds 
one and only one value of the fitted probability.

I hope this helps.

Best wishes,

Ioannis




On Thursday 02 April 2009 12:43:37 woodbomb wrote:
> Ioannis,
>
> Here's an illustrative example. Note that: glm also objects to X4; X1,..,X4
> are defined as factors.
>
> I've looked (albeit in a crude way) at various examples using the perturb
> package and it seems to confirm that X4 is the source of multicollinearity.
> As I say, I think the constant row-sum condition is the source of the
> problem, but I'm not sure why or how to deal with it.
>
> Thanks for your interest (and for the finite parameter estimates brglm
> provides)!
>
> >attributes(x)
>
> $names
> [1] "X1" "X2" "X3" "X4"
>
> $row.names
> [1] "2" "3" "4" "5"
>
> $class
> [1] "data.frame"
>
> >x
>
>   X1 X2 X3 X4
> 2  0  1  0  1
> 3  0  1  1  0
> 4  1  0  0  1
> 5  1  0  1  0
>
> >attributes(y)
>
> $dim
> [1] 4 2
>
> $dimnames
> $dimnames[[1]]
> NULL
>
>
> $dimnames[[2]]
> [1] "s" "f"
>
> >y
>
>      s f
> [1,] 3 7
> [2,] 2 8
> [3,] 5 5
> [4,] 3 7
>
> >summary(mod.simple)
>
> Call:
> brglm(formula = cbind(s, f) ~ X1 + X2 + X3 + X4, family = binomial,
>     data = data)
>
>
> Coefficients: (1 not defined because of singularities)
>
> (Dispersion parameter for binomial family taken to be 1)
>
>     Null deviance: 4.5797  on 5  degrees of freedom
> Residual deviance: 3.6469  on 2  degrees of freedom
> Penalized deviance: -1.79616
> AIC:  26.793
>
> >summary(mod.simple.brglm)
>
> Call:
> glm(formula = cbind(s, f) ~ X1 + X2 + X3 + X4, family = binomial,
>     data = data)
>
> Deviance Residuals:
>       1        2        3        4        5        6
>  0.7103  -1.0256   0.3445   0.3760  -1.1876   0.6072
>
> Coefficients: (1 not defined because of singularities)
>               Estimate Std. Error  z value Pr(>|z|)
> (Intercept) -1.356e+00  9.219e-01   -1.471    0.141
> X11          2.445e-01  7.003e-01    0.349    0.727
> X21          7.264e-01  7.048e-01    1.031    0.303
> X31          6.316e-14  6.959e-01 9.08e-14    1.000
> X41                 NA         NA       NA       NA
>
> (Dispersion parameter for binomial family taken to be 1)
>
>     Null deviance: 5.0363  on 5  degrees of freedom
> Residual deviance: 3.5957  on 2  degrees of freedom
> AIC: 26.742
>
> Number of Fisher Scoring iterations: 4




More information about the R-help mailing list