[R] glm.nb - theta, dispersion, and errors

Mon Oct 22 08:38:15 CEST 2012

On Sun, 21 Oct 2012, Eiko Fried wrote:

> I am running 9 negative binomial regressions with count data.
>
> The nine models use 9 different dependent variables - items of a clinical
> screening instrument - and use the same set of 5 predictors. Goal is to
> find out whether these predictors have differential effects on the items.
>
> Due to various reasons, one being that I want to avoid overfitting models,
> I need to employ identical types of models for all 9 regressions.
>
> The problem is that some of my dependent variables are overdispersed,
> others are not (parameter estimates gained by using family=quasipoisson).
>
> dispersion p         theta
> S1 1.084 0.102 30.0
> S2 0.903 0.000 4125.0       PROBLEM
> S3 0.997 0.926 3754.0       PROBLEM
> S4 0.784 0.000 10283.0     PROBLEM
> S5 1.108 0.003 8.6
> S6 1.010 0.772 1287.0       PROBLEM
> S7 1.228 0.001 1.7
> S8 1.222 0.005 0.4
> S9 2.120 0.283 1.2
>
> So I thought using maximum likelihood Poisson models would provide wrong
> results, and that using negative binomial models would be the best way to
> go.
>
> 3 models cause the following problem (the ones that are not overdispersed):
>> m2.nb <- glm.nb(t0s2 ~ Sex + HisDep + FamHis + ZEFE + ZNeuro, data=data)
>
>> Error in while ((it <- it + 1) < limit && abs(del) > eps) { :
>> missing value where TRUE/FALSE needed
>
> What happens is that, looking at theta iterations, values become too large,
> and thereby infinite, and thereby NaN.
> This is an example for model 2 above, with a theta of 4125 and a
> significantly smaller dispersion value than 1 of .903 (p<.000):
>
>> theta.ml: iter76 theta =1.34118e+16
>> theta.ml: iter77 theta =3.0232e+16
>> theta.ml: iter78 theta =-Inf
>> theta.ml: iter79 theta =NaN
>
> One symptom causes this problem:
> 1: In sqrt(1/i) : NaNs produced
>
> How would you recommend to deal with these problems? Are results reliable
> although these errors occur?
>
> Again, I have to use a model that fits all 9 models overall best - maybe I
> should default to maximum likelihood Poisson?

>From the description of your response in earlier e-mails on this list, my 
feeling is that this is caused by the type of response you have. It is not 
really a count response and might be better modeled by an ordered response 
model. Below I've simulated data from an ordered probit model and then 
pretended they are count data. The result are the symptoms you describe: 
underdispersion, non-convergence for theta in NB. Fitting an ordered 
probit model, avoids these problems for this data and (not surprisingly) 
reasonably recovers the true parameters.

hth,
Z

## random regressor and latent Gaussian variable
set.seed(1)
x <- runif(200)
y0 <- 5 * x + rnorm(200)

## collapse to factor with 5 levels
yf <- cut(y0, c(-Inf, 3, 4, 5, 6, Inf), labels = 0:4)

## pretend levels are count variables
yc <- as.numeric(as.character(yf))

## quasi-Poisson with underdispersion
summary(glm(yc ~ x, family = quasipoisson))

## NB with problems in theta estimation
summary(glm.nb(yc ~ x))

## ordered probit recovers true parameters
summary(polr(yf ~ x, method = "probit", Hess = TRUE))

## compare with the latent linear model
summary(lm(y0 ~ x))

> Thank you
> Torvon
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>