[Rd] nobs() with glm(family="poisson")

Thu Feb 28 00:46:46 CET 2013

On Feb 27, 2013, at 21:55 , Milan Bouchet-Valat wrote:

> Thanks for the (critical, indeed) answer!
> 
> Le mercredi 27 février 2013 à 20:48 +0100, peter dalgaard a écrit :
>> On Feb 27, 2013, at 19:46 , Milan Bouchet-Valat wrote:
>> 
>>> I cannot believes nobody cares about this -- or I'm completely wrong and
>>> in that case everybody should rush to put the shame on me... :-p
>> 
>> Well, nobs() is the number of observations. If you have 5 Poisson
>> distributed counts, you have 5 observations.
> Well, say that to the statistical offices that spend millions to survey
> thousands of people with correct (but complex) sampling designs, they'll
> be happy to know that the collected data only provides an information
> equivalent to 5 independent outcomes. ;-)

My objection is mainly technical/conceptual: Suppose 5 Poisson counts, say of the number of defaults in 5 counties, are not 5 observations. Then how many observations are 5 negative binomial counts, say of white blood cell counts in 5 patients? A generic function called nobs() should mork similarly across a range of fitted models and it would be inconsistent if it suddenly did something different in a single distribution.

> 
>> If the number of observations is not the right thing to use in some
>> context, use the right thing instead. Changing the definition of
>> nobs() surely leads to madness. 
> It is common usage in the literature using log-linear models to report
> the sum of counts as the number of observations. I think this indeed
> makes sense, but I'm not particularly attached to the choice of words --
> let's call it as you please.

It makes OK sense in isolation, I suppose. Especially if you interpret the table as multinomial counts rather than Poisson ones. If you interpret the total count as a Poisson variable, all cell counts become independent Poisson variables. However, the issue here is about coherent and consistent software design, and that goes beyond dealing with contingency tables.

> 
> The root issue is that nobs() was precisely introduced to be the basis
> for the BIC() function, as ?nobs states explicitly:
>>     Extract the number of ‘observations’ from a model fit.  This is
>>     principally intended to be used in computing BIC (see ‘AIC’)
> 

I think it is unfortunate to specify a function in terms of what it is used for. It should be specified in terms of what it does.

> So it's OK to say that the number of observations is the number of cells
> (even if I think this is not very user-friendly), but then the
> documentation is misleading, and the BIC() function returns incorrect
> values for the very first example provided in ?glm.
> 
>> (I suppose that the fact that n is so obviously the wrong thing for
>> one particularly well-digested family of distribution functions could
>> be taken to indicate a generic weakness with the BIC.)
> I'm sure we can agree on the fact that BIC has its weaknesses (and I'm
> not the best person able to judge), but the point at stake is IMHO not
> one of them. After all, usual statistics for the Poisson family, such as
> deviance or residuals, are based on the sum of counts, not on the number
> of cells, and nobody objects.

At least for the deviance, that's just untrue. The deviance is zero for a saturated table. If some cells are split, the deviance becomes nonzero.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com