[R] Unexpected behavior with weights in binomial glm()
bbolker at gmail.com
Mon Oct 1 00:15:08 CEST 2012
Bert Gunter <gunter.berton <at> gene.com> writes:
> I haven't followed this thread closely, but if perfect separation in a
> binomial glm is the problem, google it. e.g.
> This presumably explains your concerns about coefficient agreement.
Agreed. The rest of my answer is below.
Josh Browning <rockclimber112358 <at> gmail.com> writes:
> Yes, I agree that the results are "very similar" but I don't
> understand why they are not exactly equal given that the data sets are
> And yes, this 1% numerical difference is hugely important to me. I
> have another data set (much larger than this toy example) that works
> on the aggregated data (returning a coefficient of about 1) but
> returns the warning about perfect separation on the non-aggregated
> data (and a coefficient of about 1e15). So, I'd at least like to be
> able to understand where this numerical difference is coming from and,
> preferably, a way to tweak my glm() runs (possibly adjusting the
> numerical precision somehow???) so that this doesn't happen.
I played around with this a bit, and I think the problem is so
numerically unstable that you really can't just tweak the settings on
glm() to make it work. (When a problem is numerically unstable,
nearly trivial differences like the order of operations or even the
compiler used can make big differences in the results.)
There's a very nice blog post about the numerics of GLM here:
One of the conclusions is
And most practitioners are unfamiliar with this situation
[numerical instability of GLMs in some cases] because:
* They rightly do not concern themselves with the implementation
details, as these are best left to the software implementors.
* They are very likely to encounter issues arise from separation,
which will mask other issues.
You appear to have a (near- or complete-) separation problem.
I would strongly recommend
the logistf package (when I tried it, I got near-identical results
from the aggregated and disaggregated data).
I would also argue that if a 1% difference in the estimate of a
parameter whose confidence interval is essentially undefined (try
MASS:::confint() on your results) is concerning you, then you have
some bigger problems to wrestle with ...
More information about the R-help