[R] Separation issue in binary response models - glm, brglm, logistf

Thu Feb 28 18:56:20 CET 2013

Xochitl CORMON <Xochitl.Cormon <at> ifremer.fr> writes:

> Le 28/02/2013 17:22, Ben Bolker a écrit :
> > Xochitl CORMON<Xochitl.Cormon<at> ifremer.fr> writes:

> >> I am encountering some issues with my data and need some help. I am
> >> trying to run glm analysis with a presence/absence variable as
> >> response variable and several explanatory variable (time,
> >> location, presence/absence data, abundance data).

[snip]

> >> * logistf : analysis does not run When running logistf() I get a
> >> error message saying : # error in chol.default(x) : # leading minor
> >> 39 is not positive definite I looked into logistf package manual,
> >> on Internet, in the theoretical and technical paper of Heinze and
> >> Ploner and cannot find where this function is used and if the error
> >> can be fixed by some settings.
> >
> > chol.default is a function for Cholesky decomposition, which is going
> > to be embedded fairly deeply in the code ...
> 
> If I understand good I should just not use this package as this error is 
> not easily fixable ?

  Yes.

> >> * brglm : analysis run However I get a warning message saying : #
> >> In fit.proc(x = X, y = Y, weights = weights, start = start,
> >> etastart # = etastart, : # Iteration limit reached Like before i
> >> cannot find where and why this function is used while running the
> >> package and if it can be fixed by adjusting some settings.
> >>
> >> In a more general way, I was wondering what are the fundamental
> >> differences of these packages.
> >
> > You might also take a crack with bayesglm() in the arm package, which
> > should (?) be able to overcome the separation problem by specifying a
> > not-completely-uninformative prior.
> 
> Thank you for the tip I will have a look into this package and its doc 
> tomorrow. Do you have any idea of what is this fit.proc function ?

  It is again deep inside brglm.  You can use debug() to try to 
follow the process, but it will probably not help much.  **However**
you should definitely see ?brglm and especially ?brglm.control ...
adding

 control.brglm=brglm.control(br.maxit=1000)

to your function call might help (the default is 100)

> Here an extract of my table and the different formula I run :
> >>
> >>> head (CPUE_table)
> >> Year Quarter Subarea Latitude Longitude Presence.S CPUE.S
> >> Presence.H CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW
> >> Presence.C CPUE.C Presence.P CPUE.P Presence.W CPUE.W 1 2000 1 31F1
> >> 51.25 1.5 0 0 0 0 0 0 0 0 1 76.002 0 0 1 3358.667
> >
> > [snip]
> >
> >> logistf_binomPres<- logistf (Presence.S ~ (Presence.BW + Presence.W
> >> + Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW +
> >> CPUE.H + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter +
> >> Latitude + Longitude)^2, data = CPUE_table)
> >>
> >> Brglm_binomPres<- brglm (Presence.S ~ (Presence.BW + Presence.W +
> >> Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H
> >> + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude +
> >> Longitude)^2, family = binomial, data = CPUE_table)
> >
> > It's not much to go on, but:
> 
> > * are you overfitting your data? That is, do you have at least 20
> > times as many 1's or 0's (whichever is rarer) as the number of
> > parameters you are trying to estimated?
> 
> I have 16 explanatory variable and with interactions we go to 136 
> parameters.
> 
>  > length (which((CPUE_table)[,]== 0))
> [1] 33466
> 
>  > length (which((CPUE_table)[,]== 1))
> [1] 17552
> 
> I assume the over fitting is good, isn't it?

  No, overfitting is *bad*.  But you do have a large data set, so you
might get away with fitting a model that's this complex.  I would
certainly start by seeing if you can successfully fit your model with
main effects only (i.e. temporarily get rid of the ^2)

 I don't think that your statements above really count the
zeros and ones in the _response_ variable -- I think you need

table(CPUE_table$Presence.S)

> > * have you examined your data graphically and looked for any strong
> > outliers that might be throwing off the fit?
> 
> I did check my data graphically in a lot and different ways. However if 
> you have any particular suggestions, please let me know. Concerning 
> strong outliers, I do not really understand what you mean. I have 
> outliers here and there but how can I know that they are strong enough 
> to throw off the fit? Most of the time they are really high abundance 
> coming from the fact that I'm using survey data and probably related to 
> the fact that the boat fished over a fish school.
> 
> > * do you have some strongly correlated/multicollinear predictors?
> 
> It's survey data so they indeed are correlated in time and space. 
> However I checked the scatterplot matrix and I didn't notice any linear 
> relation between variable.
> 
> > * for what it's worth it looks like a variety of your variables
> > might be dummy variables, which you can often express more compactly
> > by using a factor variable and letting R construct the design matrix
> > (i.e. generating the dummy variables on the fly), although that
> > shouldn't change your results
> 
> I will check about dummy variable concept as to be honest I don't really 
> understand what it means...
> 

  What it means is that if you have a categorical factor with
several levels (e.g. A, B, C) the way to fit a linear model
is to construct 'dummy variables' or 'contrasts' so that instead
of

var = A B C C B B A 

you have three variables:

intercept  B   C
1          0   0
1          1   0
1          0   1
1          0   1
1          1   0
1          1   0
1          0   0