[R] Unexpected behavior with weights in binomial glm()

Mon Oct 1 01:27:37 CEST 2012

Hi Josh,

I would not trust either result in this case.  Additional numerical
precision is not likely to help (well, perhaps agreement between the
two sets of results, but agreement doesn't help if neither results are
meaningful).  If this were my data, I would try an approach like this:

# load required package
require(MCMCglmm)
# the categorical family is binomial for this case
# note the prior on the fixed effects (B)
# usually, a large variance is used (like 1e10)
# however, in this case I shrink it to get better mixing
# this has the effect of shrinking the posterior mode towards 0
# residual variance is fixed in logistic models (here at 1)
# because of separation, high autocorrelation for INDEP
# I use a high thinning interval and large number of iterations to get
around this
m <- MCMCglmm(RESP ~ INDEP, data = d, family = "categorical",
  prior = list(
    B = list(mu = c(0, 0), V = diag(2) * 1e3),
    R = list(V = 1, fix = 1)),
  nitt = 1e7, thin = 10000, burnin = 50000, verbose=FALSE)
# check for autocorrelation
plot(m$Sol)
autocorr(m$Sol)
# summary
summary(m)

# scaling constant
k <- ((16*sqrt(3))/(15*pi))^2
# divide data by root scaling constant + 1
# (because we fixed residual variance above at 1)
# this scales our estimates similarly as from a standard ML logistic model
posterior.mode(m$Sol/(sqrt(1 + k)))
## which for me gave the estimates
## (Intercept)       INDEP
##  -1.602042   12.697246

# highest posterior density interval (like 95% CI)
HPDinterval(m$Sol/(sqrt(1 + k)))
##                 lower      upper
## (Intercept) -4.289269  0.7573618
## INDEP        1.670226 53.9993992
## attr(,"Probability")
## [1] 0.9497487

# example showing how close the match is to ML without separation
m2 <- MCMCglmm(vs ~ mpg, data = mtcars, family = "categorical",
  prior = list(
    B = list(mu = c(0, 0), V = diag(2) * 1e10),
    R = list(V = 1, fix = 1)),
  nitt = 1e6, thin = 500, burnin = 10000)
plot(m2$Sol)
summary(m2)
k <- ((16*sqrt(3))/(15*pi))^2
posterior.mode(m2$Sol/(sqrt(1 + k)))
summary(glm(vs ~mpg, data = mtcars, family = binomial))

I do not think that the MCMCglmm function takes weights, but you can
manually expand your data so that you do not need the weights.  You
will likely need to play around with the variance on the fixed effects
(the V in the B prior), thinning interval (higher leads to less
autocorrelation but requires a higher number of iterations) to get
stable results.  I would probably target an effective sample size
(eff.sample in the summary output of MCMCglmm object) of at least
1,000.  If you do 10000 iterations, but thinning 100, you only store
(10000/100 = 100) samples.  If there is still any autocorrelation, the
effective sample size will be even lower.  Hence, thinning 10,000, and
hoping to store about 1,000, I set number of iterations at 10000 *
1000 = 1e7.  Reality is *slightly* less because of the 50,000 burnin.
Note that this may be rather slow (a few minutes for your toy example
on my system and will only be longer if the real case is huge).

Unless you have very weird scaling (like a predictor that only ranges
from .0001 to .0002 (i.e., has tiny variance), a coefficient of about
1e15 seems highly suspect as nonsense to me.

I described another option (that relies on similar techniques) on this
page: http://www.ats.ucla.edu/stat/R/dae/exlogit.htm

Good luck,

Josh

On Sun, Sep 30, 2012 at 4:47 AM, Josh Browning
<rockclimber112358 at gmail.com> wrote:
> Hi David,
>
> Yes, I agree that the results are "very similar" but I don't
> understand why they are not exactly equal given that the data sets are
> identical.
>
> And yes, this 1% numerical difference is hugely important to me.  I
> have another data set (much larger than this toy example) that works
> on the aggregated data (returning a coefficient of about 1) but
> returns the warning about perfect separation on the non-aggregated
> data (and a coefficient of about 1e15).  So, I'd at least like to be
> able to understand where this numerical difference is coming from and,
> preferably, a way to tweak my glm() runs (possibly adjusting the
> numerical precision somehow???) so that this doesn't happen.
>
> Josh
>
> On Sat, Sep 29, 2012 at 7:50 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>>
>> On Sep 29, 2012, at 7:10 AM, Josh Browning wrote:
>>
>>> Hi useRs,
>>>
>>> I'm experiencing something quite weird with glm() and weights, and
>>> maybe someone can explain what I'm doing wrong.  I have a dataset
>>> where each row represents a single case, and I run
>>> glm(...,family="binomial") and get my coefficients.  However, some of
>>> my cases have the exact same values for predictor variables, so I
>>> should be able to aggregate up my data frame and run glm(...,
>>> family="binomial",weights=wts) and get the same coefficients (maybe
>>> this is my incorrect assumption, but I can't see why it would be).
>>> Anyways, here's a minimum working example below:
>>>
>>>> d = data.frame( RESP=c(rep(1,5),rep(0,5)), INDEP=c(1,1,1,1,0,0,0,0,0,0) )
>>>> glm( RESP ~ INDEP, family="binomial", data=d )
>>>
>>> Call:  glm(formula = RESP ~ INDEP, family = "binomial", data = d)
>>>
>>> Coefficients:
>>> (Intercept)        INDEP
>>>     -1.609       21.176
>>>
>>> Degrees of Freedom: 9 Total (i.e. Null);  8 Residual
>>> Null Deviance:      13.86
>>> Residual Deviance: 5.407        AIC: 9.407
>>>> dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
>>>> colnames(dAgg) = c("RESP","INDEP","WT")
>>>> glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT )
>>>
>>> Call:  glm(formula = RESP ~ INDEP, family = "binomial", data = dAgg,
>>>    weights = WT)
>>>
>>> Coefficients:
>>> (Intercept)        INDEP
>>>     -1.609       20.975
>>>
>>> Degrees of Freedom: 2 Total (i.e. Null);  1 Residual
>>> Null Deviance:      13.86
>>> Residual Deviance: 5.407        AIC: 9.407
>>
>> Those two results look very similar and it is with a data situation that seems somewhat extreme. The concern is for the 1% numerical  difference in the regression coefficient? Am I reading you correctly?
>>
>> --
>> David Winsemius, MD
>> Alameda, CA, USA
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/