[R] Poisson Regression: questions about tests of assumptions

Mon Oct 15 08:04:53 CEST 2012

On Sun, 14 Oct 2012, Eiko Fried wrote:

> Thank you for the detailed answer, that was really helpful. I did some 
> excessive reading and calculating in the last hours since your reply, 
> and have a few (hopefully much more informed) follow up questions.
> 
> 1) In the vignette("countreg", package = "pscl"), LLH, AIC and BIC 
> values are listed for the models Negative-binomial (NB), Zero-Inflated 
> (ZI), ZI NB, Hurdle NB, and Poisson (Standard). And although I found a 
> way to determine LLH and DF for all models, BIC & AIC values are not 
> displayed by default, neither using the code given in the vignette. How 
> do I calculate these? (AIC is given as per default only in 2 models, BIC 
> in none).

The code underlying the vignette can always be inspected:
edit(vignette("countreg", package = "pscl"))
The JSS also hosts a cleaned-up version of the replication code.

Most likelihood-based models provide a logLik() method which extracts the 
fitted log-likelihood along with the corresponding degrees of freedom. 
Then the default AIC() method can compute the AIC and AIC(..., k = log(n)) 
computes the corresponding BIC. This is hinted at in Table 3 of the 
vignette. If "n", the number of observations, can be extracted by nobs(), 
then also BIC() works. This is not yet the case for zeroinfl/hurdle, 
though.

> 2) For the zero-inflated models, the first block of count model 
> coefficients is only in the output in order to compare the changes, 
> correct? That is, in my results in a paper I would only report the 
> second block of (zero-inflation) model coefficients? Or do I 
> misunderstand something?

No, no, yes. Hurdle and zero-inflation models have two linear predictors, 
one for the zero hurdle/inflation and one for the truncated/un-inflated 
count component. Both are typically of interest for different aspects of 
the data.

> I am
> confused because in their large table to compare coefficients, they
> primarily compare the first block of coefficients (p. 18)
> R> fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nbin,
> + "Hurdle-NB" = fm_hurdle, "ZINB" = fm_zinb) 
> R> sapply(fm, function(x) coef(x)[1:8])

All models considered have a predictor for the mean of the count 
component, hence this can be compared across all models.

> 
> 3) 
>       There are various formal tests for this, e.g., dispersiontest()
>       in package "AER".
> 
> 
> I have to run 9 models - I am testing the influence of several 
> predictors on different individual symptoms of a mental disorder, as 
> "counted" in the last week (0=never in the last week, to 3 = on all day 
> within the last week). So I'm regressing the same predictors onto 9 
> different symptoms in 9 models. 

That's not really a count response. I guess an ordered response model (see 
e.g. polr() in MASS or package ordinal) would probably be more 
appropriate.

> Dispersiontest() for these 9 models result in 3-4 overdispersed models
> (depending if testing one- or two-sided on p=.05 level), 2 underdispersed
> models, and 4 non-significant models. The by far largest dispersion value is
> 2.1 in a model is not overdispersed according to the test, but that's the
> symptom with 80% zeros, maybe that plays a role here. 
> 
> Does BN also make sense in underdispersed models?

The variance function of the NB2 model is mu + 1/theta * mu^2. Hence for 
finite theta, the variance is always larger than the mean.

>       However, overdispersion can already matter before this is
>       detected by a significance test. Hence, if in doubt, I would
>       simply use an NB model and you're on the safe side. And if the
>       NB's estimated theta parameter turns out to be extremely large
>       (say beyond 20 or 30), then you can still switch back to Poisson
>       if you want.
> 
> Out of the 9 models, the 3 overdispersed NB models "worked" with theta
> estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
> appropriate here. 
> 
> 4 other models (including the 2 underdispersed ones) gave warnings (theta
> iteration / alternation limit reached), and although the other values look
> ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000. 
> Could you recommend me what to do here? 

As I said before: A theta around 20 or 30 is already so large that it is 
virtually identical to a Poisson fit (theta = Inf). These values clearly 
indicate that theta is not finite.

However, this almost certainly stems from your response which is not 
really count data. As I said above: An ordered response model will 
probably work better here. If you have mainly variation between 0 vs. 
larger but not much among the levels 1/2/3, a binary response (0 vs. 
larger) may be best.

hth,
Z

> Thanks
> T
> 
>