[R] comparing the fit of two (gamma) distributions for aggregated data

Wed Oct 6 14:08:58 CEST 2010

Hi,

I have aggregated data in 5 categories and want to fit a gamma
distribution to it (works fine).
My question is that on theoretical grounds I could claim that the
observation in the first category (the zero count) are certain, i.e. I
know for sure the number. This would then mean in fitting a gamma
distribution to the remaining 4 categories.

My question is now how can I compare the fit between the two to choose
what fits better?

I think the AIC/BIC (or all likelihood based test) don't apply as I
compare two different data sets (one with data for all 5 categories
and one with the data of the  4 categories, hence the number of
observations differs substantially) and similarly I think chi2 square
test is not really revealing as in  the number of observations is so
large (the Null is rejected in both cases). Or would one of the tests
still apply, claiming one is somehow nested in the other?

Another I idea might be to fit to the 5 categories but restrict the
estimator somehow so that it replicates the number of observations in
the first category (all basically zero). I don't know if that is
advisable (and if so, feasible). Any ideas or comments welcome.

Below is some code with example data (but I have much more data to fit
than just those).

library(stats4)

breaks.5 <- c(0,.25,20,40,60,120)                            #breaks
counts.5 <- c(4422, 40225,	10973,	3145	,3228)   #number of obs,
breaks.4 <- c(.25,20,40,60,120)                            #like above
but just without the first categories
counts.4 <- c( 40225,	10973,	3145	,3228)   #number of obs,

ll <- function(shape, rate) #my likelihood function
{   z <- pgamma(breaks, shape=shape, rate=rate)
     -sum(counts * log(diff(z)))
}

#fitting the 5 categories
breaks <-breaks.5
counts <- counts.5
results.mle.5<-mle(ll, start=list(shape=1, rate=1/mean(breaks)))

#fitting the 4 categories
breaks <-breaks.4
counts <- counts.4
results.mle.4<-mle(ll, start=list(shape=1, rate=1/mean(breaks)))

Thanks,
Stefan