[R] Unexpected behavior with weights in binomial glm()

Tue Oct 2 18:18:42 CEST 2012

Thank you all for your help and advice.  This wasn't quite the answer
I was looking for, but these concepts make more sense to me now and I
think I should be able to resolve the issues I've been having.

Thanks again!

On Sun, Sep 30, 2012 at 6:26 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Sep 30, 2012, at 4:47 AM, Josh Browning wrote:
>
>> Hi David,
>>
>> Yes, I agree that the results are "very similar" but I don't
>> understand why they are not exactly equal given that the data sets are
>> identical.
>>
>> And yes, this 1% numerical difference is hugely important to me.
>
> I believe you will find it is not important at all.
>
>> I have another data set (much larger than this toy example) that works
>> on the aggregated data (returning a coefficient of about 1) but
>> returns the warning about perfect separation on the non-aggregated
>> data (and a coefficient of about 1e15).  So, I'd at least like to be
>> able to understand where this numerical difference is coming from and,
>> preferably, a way to tweak my glm() runs (possibly adjusting the
>> numerical precision somehow???) so that this doesn't happen.
>
> Here. This shows how you can ratchet up your desired precision:
>
> dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
> colnames(dAgg) = c("RESP","INDEP","WT")
>
> glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-10,
>     maxit = 50) )
> glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-11,
>     maxit = 30) )
> glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-12,
>     maxit = 50) )
> glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-13,
>     maxit = 50) )
>
>  ... and finally get a message that _should_ demonstrate you why I thought asking for greater numerical precision in an extreme case such as this was a "Fool's errand". The "true" value is positive infinity, but numerical precision is only 8 bytes so we could only get to an estimated odds ratio of exp(32.975 ) = 2.09344e+14, which corresponds to odds of 200 trillion to 1. Further increases on the 'maxit' and decreases in 'epsilon' are vigorously ignored, .... and rightfully so.
>
> After you do such analyses long enough you learn that coefficients above 5 are suspicious and those above 10 are usually a sign of an error in data preparation.
>
>  (I do not think this is an example of complete separation, since the range of the predictors overlap for the two outcomes. But who know I have made (many) errors before.)
>
> --
> David,
>
>
>>
>> Josh
>>
>> On Sat, Sep 29, 2012 at 7:50 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>>>
>>> On Sep 29, 2012, at 7:10 AM, Josh Browning wrote:
>>>
>>>> Hi useRs,
>>>>
>>>> I'm experiencing something quite weird with glm() and weights, and
>>>> maybe someone can explain what I'm doing wrong.  I have a dataset
>>>> where each row represents a single case, and I run
>>>> glm(...,family="binomial") and get my coefficients.  However, some of
>>>> my cases have the exact same values for predictor variables, so I
>>>> should be able to aggregate up my data frame and run glm(...,
>>>> family="binomial",weights=wts) and get the same coefficients (maybe
>>>> this is my incorrect assumption, but I can't see why it would be).
>>>> Anyways, here's a minimum working example below:
>>>>
>>>>> d = data.frame( RESP=c(rep(1,5),rep(0,5)), INDEP=c(1,1,1,1,0,0,0,0,0,0) )
>>>>> glm( RESP ~ INDEP, family="binomial", data=d )
>>>>
>>>> Call:  glm(formula = RESP ~ INDEP, family = "binomial", data = d)
>>>>
>>>> Coefficients:
>>>> (Intercept)        INDEP
>>>>    -1.609       21.176
>>>>
>>>> Degrees of Freedom: 9 Total (i.e. Null);  8 Residual
>>>> Null Deviance:      13.86
>>>> Residual Deviance: 5.407        AIC: 9.407
>>>>> dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
>>>>> colnames(dAgg) = c("RESP","INDEP","WT")
>>>>> glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT )
>>>>
>>>> Call:  glm(formula = RESP ~ INDEP, family = "binomial", data = dAgg,
>>>>   weights = WT)
>>>>
>>>> Coefficients:
>>>> (Intercept)        INDEP
>>>>    -1.609       20.975
>>>>
>>>> Degrees of Freedom: 2 Total (i.e. Null);  1 Residual
>>>> Null Deviance:      13.86
>>>> Residual Deviance: 5.407        AIC: 9.407
>>>
>>> Those two results look very similar and it is with a data situation that seems somewhat extreme. The concern is for the 1% numerical  difference in the regression coefficient? Am I reading you correctly?
>>>
>>> --
>>> David Winsemius, MD
>>> Alameda, CA, USA
>>>
>
> David Winsemius, MD
> Alameda, CA, USA
>