[Rd] nobs() with glm(family="poisson")

John Fox jfox at mcmaster.ca
Thu Feb 28 00:59:01 CET 2013


Dear Milan and Steven,

At the risk of muddying the water further, I think that the potential confusion here is that Poisson GLMs are applied in two formally equivalent but substantively different situations: (1) where the counts are cells in a contingency table, in which case the Poisson GLM is used to fit an equivalent loglinear association model to the table; and (2) where the counts are observations on a non-negative integer response -- what's often called "Poisson regression." In the first case, but not the second, it makes sense to think of the sum of the counts as the natural sample size. I don't think that one can expect GLM software to distinguish these cases.

Best,
 John

> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-
> project.org] On Behalf Of Milan Bouchet-Valat
> Sent: Wednesday, February 27, 2013 5:59 PM
> To: Steven McKinney
> Cc: r-devel
> Subject: Re: [Rd] nobs() with glm(family="poisson")
> 
> Le mercredi 27 février 2013 à 14:26 -0800, Steven McKinney a écrit :
> >
> > > -----Original Message-----
> > > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-
> project.org]
> > > On Behalf Of Milan Bouchet-Valat
> > > Sent: February-27-13 12:56 PM
> > > To: peter dalgaard
> > > Cc: r-devel
> > > Subject: Re: [Rd] nobs() with glm(family="poisson")
> > >
> > > Thanks for the (critical, indeed) answer!
> > >
> > > Le mercredi 27 février 2013 à 20:48 +0100, peter dalgaard a écrit :
> > > > On Feb 27, 2013, at 19:46 , Milan Bouchet-Valat wrote:
> > > >
> > > > > I cannot believes nobody cares about this -- or I'm completely
> wrong
> > > and
> > > > > in that case everybody should rush to put the shame on me... :-
> p
> > > >
> > > > Well, nobs() is the number of observations. If you have 5 Poisson
> > > > distributed counts, you have 5 observations.
> > > Well, say that to the statistical offices that spend millions to
> survey
> > > thousands of people with correct (but complex) sampling designs,
> they'll
> > > be happy to know that the collected data only provides an
> information
> > > equivalent to 5 independent outcomes. ;-)
> >
> > Milan:
> >
> > It seems to me you are mixing up Binomial and Poisson situations,
> > and not assessing independence appropriately.
> >
> > The above example discusses Bernoulli outcomes which are sometimes
> > aggregated into Binomial "cases" depending on the study design.
> > Now if the survey samples people in the same household or even
> > neighbourhood, those Bernoulli outcomes will not be independent
> > (hence clustered survey techniques) and summing the Binomial
> > denominators would not be appropriate, for the survey analysis or
> > for BIC calculations.  The "n" in the BIC calculation should
> > reflect independent observations.  If you knock on the same
> > door 1000 times and ask the person who they will vote for,
> > you do not have 1000 independent observations, even though
> > your Binomial denominator is 1000.
> My intention was not to introduce the issue of survey designs into the
> discussion, but merely to make the point that in surveys, counts are
> usually *to some extent at least* independent observations, even when
> clustering is present, and that the fact that different people are
> asked
> and that each answer costs money is the best indication of that.
> Anyway,
> BIC does not apply if we are not assuming that the data comes from a
> simple random sample, so let's leave this complication aside.
> 
> > The example you show from ?glm is a Poisson example showing
> > 9 independent Poisson counts.  If I count the number of cars
> > passing through an intersection during non-overlapping
> > one minute intervals (say 9 such intervals), then the number
> > of observations I have is the number of non-overlapping
> > one minute interval car count totals (e.g. the nine counts
> > c(18, 17, 15, 20, 10, 20, 25, 13, 12)), not the number of
> > cars I saw in total.
> Interesting. Indeed in the observation setting you describe, 9 is
> AFAICT
> the correct number of observations. Is this kind of data commonly
> fitted
> using glm()?
> 
> Do you happen to possess a copy of the book where the ?glm example
> comes
> from? There are not many of them here in France so I cannot consult it
> easily. It seems to me that in the context of a randomized controlled
> trial, the number of independent observations is the number of
> subjects,
> not the number of groups. And thus, BIC() would still return a wrong
> value for the ?glm example.
> 
> > A piece of software that adds things up can not know the
> > context from which the numbers were derived, so you have to
> > figure out the level of independence appropriate to your
> > study design and work out the BIC count accordingly.
> This is a strong argument indeed. It would mean that BIC() is at best a
> function of very limited use, or even a dangerous one, unless one can
> safely assume that the case were the number of observations equals the
> number of rows in the data is by far the most common one. I am biased
> due to my use of log-linear models, but I doubt this is the case. Is it
> (I might perfectly be wrong)?
> 
> > Raftery alludes to this in a preceding section:
> >
> > "When the data have been collected using a complex survey
> > design with resulting weights, it is not yet clear what n
> > should be, and this issue awaits further study.  However,
> > it seems reasonable that if the model is based on an
> > assumption of simple random sampling but the sampling
> > design is less efficient, then n should be reduced to
> > reflect the efficiency of the sampling design relative to
> > simple random sampling."
> I think Raftery had in mind surveys in which the assumption of
> independence between observations (counts, not rows) does not hold, but
> where it is still the reference from which the sample deviates (lower
> "efficiency"). In this case, the number of cells/rows by no means a
> good
> measure of the number of observations either -- but as I said BIC is
> usually considered as not defined in this case.
> 
> 
> Thanks for sharing your remarks.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list