[R] Unexpected behavior with weights in binomial glm()

Mon Oct 1 00:15:08 CEST 2012

Bert Gunter <gunter.berton <at> gene.com> writes:

> 
> I haven't followed this thread closely, but if perfect separation in a
> binomial glm is the problem, google it. e.g.
> 
> http://www.ats.ucla.edu/stat/mult_pkg/faq/general
>    /complete_separation_logit_models.htm
> 
> This presumably explains your concerns about coefficient agreement.
> 

 Agreed.  The rest of my answer is below.

Josh Browning <rockclimber112358 <at> gmail.com> writes:

> Yes, I agree that the results are "very similar" but I don't
> understand why they are not exactly equal given that the data sets are
> identical.
> 
> And yes, this 1% numerical difference is hugely important to me.  I
> have another data set (much larger than this toy example) that works
> on the aggregated data (returning a coefficient of about 1) but
> returns the warning about perfect separation on the non-aggregated
> data (and a coefficient of about 1e15).  So, I'd at least like to be
> able to understand where this numerical difference is coming from and,
> preferably, a way to tweak my glm() runs (possibly adjusting the
> numerical precision somehow???) so that this doesn't happen.
> 
> Josh

  I played around with this a bit, and I think the problem is so
numerically unstable that you really can't just tweak the settings on
glm() to make it work.  (When a problem is numerically unstable,
nearly trivial differences like the order of operations or even the
compiler used can make big differences in the results.)

There's a very nice blog post about the numerics of GLM here:

http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/

One of the conclusions is 

And most practitioners are unfamiliar with this situation
[numerical instability of GLMs in some cases] because:

* They rightly do not concern themselves with the implementation
    details, as these are best left to the software implementors.

* They are very likely to encounter issues arise from separation,
     which will mask other issues.

 You appear to have a (near- or complete-) separation problem.  
 I would strongly recommend
the logistf package (when I tried it, I got near-identical results
from the aggregated and disaggregated data).

 I would also argue that if a 1% difference in the estimate of a
parameter whose confidence interval is essentially undefined (try
MASS:::confint() on your results) is concerning you, then you have
some bigger problems to wrestle with ...

  good luck
    Ben Bolker