[R] Overdispersion with binomial distribution

Tue Feb 17 19:54:27 CET 2009

 Thanks for the clarification.
  I actually had MASS open to that page while
I was composing my reply but forgot to mention
it (trying to do too many things at once) ...

  Ben Bolker

Prof Brian Ripley wrote:
> On Tue, 17 Feb 2009, Ben Bolker wrote:
> 
>> Jessica L Hite/hitejl/O/VCU <hitejl <at> vcu.edu> writes:
>>
>>> I am attempting to run a glm with a binomial model to analyze proportion
>>> data.
>>> I have been following Crawley's book closely and am wondering if there is
>>> an accepted standard for how much is too much overdispersion? (e.g. change
>>> in AIC has an accepted standard of 2).
> 
>>  In principle, in the null case (i.e. data are really binomial)
>> the deviance is  chi-squared distributed with the df equal
>> to the residual df.
> 
> *Approximately*, provided the expected counts are not near or below 
> one.  See MASS §7.5 for an analysis of the size of the approximation 
> errors (which can be large and in both directions).
> 
> Given that I once had a consulting job where the over-dispersion was 
> causing something close ot panic and was entirely illusory, the lack 
> of the 'approximately' can have quite serious consequences.
> 
>>  For example:
>>
>> example(glm)
>> deviance(glm.D93) ## 5.13
>> summary(glm.D93)$dispersion ## 1 (by definition)
>> dfr <- df.residual(glm.D93)
>> deviance(glm.D93)/dfr ## 1.28
>> d2 <- sum(residuals(glm.D93,"pearson")^2) ## 5.17
>> (disp2 <- d2/dfr)  ## 1.293
>>
>> gg2 <- update(glm.D93,family=quasipoisson)
>> summary(gg2)$dispersion  ## 1.293, same as above
>>
>> pchisq(d2,df=dfr,lower.tail=FALSE)
>>
>> all.equal(coef(glm.D93),coef(gg2)) ## TRUE
>>
>> se1 <- coef(summary(glm.D93))[,"Std. Error"]
>> se2 <- coef(summary(gg2))[,"Std. Error"]
>> se2/se1
>>
>> # (Intercept)    outcome2    outcome3  treatment2  treatment3
>> #   1.137234    1.137234    1.137234    1.137234    1.137234
>>
>> sqrt(disp2)
>> # [1] 1.137234
>>
>>> My code and output are below, given the example in the book, these data are
>>> WAY overdispersed .....do I mention this and go on or does this signal the
>>> need to try a different model? If so, any suggestions on the type of
>>> distribution (gamma or negative binomial ?)?
>>  Way overdispersed may indicate model lack of fit.  Have
>> you examined residuals/data for outliers etc.?
>>
>>  quasibinomial should be fine, or you can try beta-binomial
>> (see the aod package) ...
>>
>>
>>> attach(Clutch2)
>>>  y<-cbind(Total,Size-Total)
>>> glm1<-glm(y~Pred,"binomial")
>>> summary(glm1)
>>>
>>> Call:
>>> glm(formula = y ~ Pred, family = "binomial")
>>>
>>> Deviance Residuals:
>>>     Min       1Q   Median       3Q      Max
>>> -9.1022  -2.7899  -0.4781   2.6058   8.4852
>>>
>>> Coefficients:
>>>             Estimate Std. Error z value Pr(>|z|)
>>> (Intercept)  1.35095    0.06612  20.433  < 2e-16 ***
>>> PredF       -0.34811    0.11719  -2.970  0.00297 **
>>> PredSN      -3.29156    0.10691 -30.788  < 2e-16 ***
>>> PredW       -1.46451    0.12787 -11.453  < 2e-16 ***
>>> PredWF      -0.56412    0.13178  -4.281 1.86e-05 ***
>>> ---
>>> #### the output for residual deviance and df does not change even when I
>>> use quasibinomial, is this ok?  #####
>>  That's as expected.
>>
>>>  library(MASS)
>>  you don't really need MASS for quasibinomial.
>>
>>>> glm2<-glm(y~Pred,"quasibinomial")
>>>> summary(glm2)
>>> Call:
>>> glm(formula = y ~ Pred, family = "quasibinomial")
>>>
>>> Deviance Residuals:
>>>     Min       1Q   Median       3Q      Max
>>> -9.1022  -2.7899  -0.4781   2.6058   8.4852
>>>
>>> Coefficients:
>>>             Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)   1.3510     0.2398   5.633 1.52e-07 ***
>>> PredF        -0.3481     0.4251  -0.819  0.41471
>>> PredSN       -3.2916     0.3878  -8.488 1.56e-13 ***
>>> PredW        -1.4645     0.4638  -3.157  0.00208 **
>>> PredWF       -0.5641     0.4780  -1.180  0.24063
>>> ---
>>> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>>>
>>> (Dispersion parameter for quasibinomial family taken to be 13.15786)
>>>
>>>     Null deviance: 2815.5  on 108  degrees of freedom
>>> Residual deviance: 1323.5  on 104  degrees of freedom
>>>   (3 observations deleted due to missingness)
>>> AIC: NA
>>>
>>> Number of Fisher Scoring iterations: 5
>>>
>>>> anova(glm2,test="F")
>>> Analysis of Deviance Table
>>>
>>> Model: quasibinomial, link: logit
>>>
>>> Response: y
>>>
>>> Terms added sequentially (first to last)
>>>
>>>       Df Deviance Resid. Df Resid. Dev      F   Pr(>F)
>>> NULL                    108     2815.5
>>> Pred   4   1492.0       104     1323.5 28.349 6.28e-16 ***
>>> ---
>>> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>>>> model1<-update(glm2,~.-Pred)
>>>> anova(glm2,model1,test="F")
>>> Analysis of Deviance Table
>>>
>>> Model 1: y ~ Pred
>>> Model 2: y ~ 1
>>>   Resid. Df Resid. Dev  Df Deviance      F   Pr(>F)
>>> 1       104     1323.5
>>> 2       108     2815.5  -4  -1492.0 28.349 6.28e-16 ***
>>> ---
>>> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>>>> coef(glm2)
>>> (Intercept)       PredF      PredSN       PredW      PredWF
>>>   1.3509550  -0.3481096  -3.2915601  -1.4645097  -0.5641223
>>>
>>> Thanks
>>> Jessica
>>> hitejl <at> vcu.edu
> 
> 

-- 
Ben Bolker
Associate professor, Biology Dep't, Univ. of Florida
bolker at ufl.edu / www.zoology.ufl.edu/bolker
GPG key: www.zoology.ufl.edu/bolker/benbolker-publickey.asc