[Rd] Standardized Pearson residuals

Thu Mar 17 20:46:41 CET 2011

John Maindonald <john.maindonald at anu.edu.au> writes:

> One can easily test for the binary case and not give the statistic in
> that case.
>
> A general point is that if one gave no output that was not open to
> abuse, there'd be nothing given at all!

Thanks John.  I've been reluctant to push too hard for this on r-devel,
but this is more or less the point I made to Peter in a private email.
My example was the deviance, which is just as open to abuse as Pearson's
chi-square (except that differences of deviances for nested,
non-saturated models are usually ok), but which print.glm() and
summary.glm() proudly display at every opportunity.  If we were overly
concerned about preventing abuse, we might just provide the
(log-)likelihood so that users would at least have to take some
(trivial) action to compute the "dangerous" deviance statistic.

Obviously the question is one of balancing utility against the
likelihood of abuse, and apparently you and I would agree that the
utility of having Pearson's chi-square available as a matter of course
outweighs the dangers.  I don't think it would be too difficult to get
Peter to agree, but I suspect that there would likely be a strong
general reluctance to change glm() and/or summary.glm().  This could be
avoided by just providing a convenience function, as you suggest,
without necessarily changing any existing function.  This would put
Pearson's chi-square on a more nearly equal R-footing with the deviance
as a goodness-of-fit statistic, which I don't think anyone would argue
with, while still leaving the user with a bit to do to get a p-value,
say.

One of Peter's other posts suggests, perhaps unintentionally, a
functionality that would actually return the results of a
goodness-of-fit test, though possibly using only the deviance.  This
might indeed be useful, but I would still like to have an easy,
standardized way to just get Pearson's chi-square statistic.  Assuming a
situation in which either is appropriate, I tend to prefer Pearson's
chi-square over the deviance for testing goodness-of-fit, because I
think that in marginal cases its null distribution is usually better
approximated by the chi-square (if you think I'm wrong about this,
please let me know).  Pearson's chi-square also has the advantage of
being much easier to explain as a goodness-of-fit statistic than the
deviance, leaving everything else about the deviance to the realm of
likelihood-ratio tests, where it mostly belongs.  My original feature
request was, after all, mostly about having a simple, standardized way
in R for "naive users" (my students) to do some fairly standard things.

> One would not be giving any output at all from poisson or binomial
> models, given that data that really calls for quasi links (or a glmm
> with observation level random effects) is in my experience the rule
> rather than the exception!
>
> At the very least, why not a function dispersion() or pearsonchisquare()
> that gives this information.
>
> Apologies that I misattributed this.
>
> John Maindonald             email: john.maindonald at anu.edu.au
> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
> Centre for Mathematics & Its Applications, Room 1194,
> John Dedman Mathematical Sciences Building (Building 27)
> Australian National University, Canberra ACT 0200.
> http://www.maths.anu.edu.au/~johnm
>
> On 16/03/2011, at 12:41 AM, peter dalgaard wrote:
>
>> 
>> On Mar 15, 2011, at 13:42 , John Maindonald wrote:
>> 
>>>> Peter Dalgaard: It would also be nice for teaching purposes if glm or summary.glm had a
>>>> "pearsonchisq" component and a corresponding extractor function, but I
>>>> can imagine that there might be arguments against it that haven't
>>>> occured to me.  Plus, I doubt that anyone wants to touch glm unless it's
>>>> to repair a bug. If I'm wrong about all that though, ...
>>> 
>> 
>> Umm, that was Brett, actually.
>
>>> This would remedy what I have long judged a deficiency in summary.glm().
>>> The information is important for diagnostic purposes.  One should not have
>>> to fit a model with a quasi error, or suss out how to calculate the Pearson
>>> chi square from the glm model object, to discover that the information in the
>>> model object is inconsistent with simple binomial or poisson assumptions.
>> 
>> It could be somewhere between useless and misleading in cases like binary logistic regression though. (Same thing goes for the test against the saturated model: Sometimes it makes sense and sometimes not.)
>> 
>>> 
>>> John Maindonald             email: john.maindonald at anu.edu.au
>>> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
>>> Centre for Mathematics & Its Applications, Room 1194,
>>> John Dedman Mathematical Sciences Building (Building 27)
>>> Australian National University, Canberra ACT 0200.
>>> http://www.maths.anu.edu.au/~johnm
>>> 
>>> On 15/03/2011, at 10:00 PM, r-devel-request at r-project.org wrote:
>>> 
>>>> From: Brett Presnell <presnell at stat.ufl.edu>
>>>> Date: 15 March 2011 2:40:29 PM AEDT
>>>> To: peter dalgaard <pdalgd at gmail.com>
>>>> Cc: r-devel at r-project.org
>>>> Subject: Re: [Rd] Standardized Pearson residuals
>>>> 
>>>> 
>>>> 
>>>> Thanks Peter.  I have just a couple of minor comments, and another
>>>> possible feature request, although it's one that I don't think will be
>>>> implemented.
>>>> 
>>>> peter dalgaard <pdalgd at gmail.com> writes:
>>>> 
>>>>> On Mar 14, 2011, at 22:25 , Brett Presnell wrote:
>>>>> 
>>>>>> 
>>>>>> Is there any reason that rstandard.glm doesn't have a "pearson" option?
>>>>>> And if not, can it be added?
>>>>> 
>>>>> Probably... I have been wondering about that too. I'm even puzzled why
>>>>> it isn't the default. Deviance residuals don't have quite the
>>>>> properties that one might expect, e.g. in this situation, the absolute
>>>>> residuals sum pairwise to zero, so you'd expect that the standardized
>>>>> residuals be identical in absolute value
>>>>> 
>>>>>> y <- 1:4
>>>>>> r <- c(0,0,1,1)
>>>>>> c <- c(0,1,0,1)
>>>>>> rstandard(glm(y~r+c,poisson))
>>>>>       1          2          3          4 
>>>>> -0.2901432  0.2767287  0.2784603 -0.2839995 
>>>>> 
>>>>> in comparison,
>>>>> 
>>>>>> i <- influence(glm(y~r+c,poisson))
>>>>>> i$pear.res/sqrt(1-i$hat)
>>>>>       1          2          3          4 
>>>>> -0.2817181  0.2817181  0.2817181 -0.2817181 
>>>>> 
>>>>> The only thing is that I'm always wary of tampering with this stuff,
>>>>> for fear of finding out the hard way why thing are the way they
>>>>> are....
>>>> 
>>>> I'm sure that's wise, but it would be nice to get it in as an option,
>>>> even if it's not the default
>>>> 
>>>>>> Background: I'm currently teaching an undergrad/grad-service course from
>>>>>> Agresti's "Introduction to Categorical Data Analysis (2nd edn)" and
>>>>>> deviance residuals are not used in the text.  For now I'll just provide
>>>>>> the students with a simple function to use, but I prefer to use R's
>>>>>> native capabilities whenever possible.
>>>>> 
>>>>> Incidentally, chisq.test will have a stdres component in 2.13.0 for
>>>>> much the same reason.
>>>> 
>>>> Thank you.  That's one more thing I won't have to provide code for
>>>> anymore.  Coincidentally, Agresti mentioned this to me a week or two ago
>>>> as something that he felt was missing, so that's at least two people who
>>>> will be happy to see this added.
>>>> 
>>>> It would also be nice for teaching purposes if glm or summary.glm had a
>>>> "pearsonchisq" component and a corresponding extractor function, but I
>>>> can imagine that there might be arguments against it that haven't
>>>> occured to me.  Plus, I doubt that anyone wants to touch glm unless it's
>>>> to repair a bug. If I'm wrong about all that though, ...
>>>> 
>>>> BTW, as I go along I'm trying to collect a lot of the datasets from the
>>>> examples and exercises in the text into an R package ("icda").  It's far
>>>> from complete and what is there needed tidying up, but I hope to
>>>> eventually to round it into shape and put it on CRAN, assuming that
>>>> Agresti approves and that there are no copyright issues.
>>>> 
>>>>>> I think something along the following lines should do it:
>>>>>> 
>>>>>> rstandard.glm <-
>>>>>> function(model,
>>>>>>        infl=influence(model, do.coef=FALSE),
>>>>>>        type=c("deviance", "pearson"), ...)
>>>>>> {
>>>>>> type <- match.arg(type)
>>>>>> res <- switch(type, pearson = infl$pear.res, infl$dev.res)
>>>>>> res <- res/sqrt(1-infl$hat)
>>>>>> res[is.infinite(res)] <- NaN
>>>>>> res
>>>>>> }
>>>> 
>>> 
>>> 
>>> 	[[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> -- 
>> Peter Dalgaard
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com