[R] Unexpected behavior with weights in binomial glm()

Mon Oct 1 02:26:47 CEST 2012

On Sep 30, 2012, at 4:47 AM, Josh Browning wrote:

> Hi David,
> 
> Yes, I agree that the results are "very similar" but I don't
> understand why they are not exactly equal given that the data sets are
> identical.
> 
> And yes, this 1% numerical difference is hugely important to me.

I believe you will find it is not important at all.

> I have another data set (much larger than this toy example) that works
> on the aggregated data (returning a coefficient of about 1) but
> returns the warning about perfect separation on the non-aggregated
> data (and a coefficient of about 1e15).  So, I'd at least like to be
> able to understand where this numerical difference is coming from and,
> preferably, a way to tweak my glm() runs (possibly adjusting the
> numerical precision somehow???) so that this doesn't happen.

Here. This shows how you can ratchet up your desired precision:

dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
colnames(dAgg) = c("RESP","INDEP","WT")

glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-10, 
    maxit = 50) )
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-11, 
    maxit = 30) )
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-12, 
    maxit = 50) )
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-13, 
    maxit = 50) )

 ... and finally get a message that _should_ demonstrate you why I thought asking for greater numerical precision in an extreme case such as this was a "Fool's errand". The "true" value is positive infinity, but numerical precision is only 8 bytes so we could only get to an estimated odds ratio of exp(32.975 ) = 2.09344e+14, which corresponds to odds of 200 trillion to 1. Further increases on the 'maxit' and decreases in 'epsilon' are vigorously ignored, .... and rightfully so.

After you do such analyses long enough you learn that coefficients above 5 are suspicious and those above 10 are usually a sign of an error in data preparation.

 (I do not think this is an example of complete separation, since the range of the predictors overlap for the two outcomes. But who know I have made (many) errors before.)

-- 
David,

> 
> Josh
> 
> On Sat, Sep 29, 2012 at 7:50 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>> 
>> On Sep 29, 2012, at 7:10 AM, Josh Browning wrote:
>> 
>>> Hi useRs,
>>> 
>>> I'm experiencing something quite weird with glm() and weights, and
>>> maybe someone can explain what I'm doing wrong.  I have a dataset
>>> where each row represents a single case, and I run
>>> glm(...,family="binomial") and get my coefficients.  However, some of
>>> my cases have the exact same values for predictor variables, so I
>>> should be able to aggregate up my data frame and run glm(...,
>>> family="binomial",weights=wts) and get the same coefficients (maybe
>>> this is my incorrect assumption, but I can't see why it would be).
>>> Anyways, here's a minimum working example below:
>>> 
>>>> d = data.frame( RESP=c(rep(1,5),rep(0,5)), INDEP=c(1,1,1,1,0,0,0,0,0,0) )
>>>> glm( RESP ~ INDEP, family="binomial", data=d )
>>> 
>>> Call:  glm(formula = RESP ~ INDEP, family = "binomial", data = d)
>>> 
>>> Coefficients:
>>> (Intercept)        INDEP
>>>    -1.609       21.176
>>> 
>>> Degrees of Freedom: 9 Total (i.e. Null);  8 Residual
>>> Null Deviance:      13.86
>>> Residual Deviance: 5.407        AIC: 9.407
>>>> dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
>>>> colnames(dAgg) = c("RESP","INDEP","WT")
>>>> glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT )
>>> 
>>> Call:  glm(formula = RESP ~ INDEP, family = "binomial", data = dAgg,
>>>   weights = WT)
>>> 
>>> Coefficients:
>>> (Intercept)        INDEP
>>>    -1.609       20.975
>>> 
>>> Degrees of Freedom: 2 Total (i.e. Null);  1 Residual
>>> Null Deviance:      13.86
>>> Residual Deviance: 5.407        AIC: 9.407
>> 
>> Those two results look very similar and it is with a data situation that seems somewhat extreme. The concern is for the 1% numerical  difference in the regression coefficient? Am I reading you correctly?
>> 
>> --
>> David Winsemius, MD
>> Alameda, CA, USA
>> 

David Winsemius, MD
Alameda, CA, USA