[Rd] Incorrect fix for PR#9316: Cook's Distance & plot.lm

Heather.Turner at warwick.ac.uk Heather.Turner at warwick.ac.uk
Fri May 9 17:30:08 CEST 2008


This is a multi-part message in MIME format.
--------------030304040002000407020206
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Bug PR#9316 noted an inconsistency between the Cook's distance contours 
on plot.lm(x, which = 5) and the values given by cooks.distance(x) -- as 
shown in plot.lm(x, which = 4) -- for glms:

http://bugs.r-project.org/cgi-bin/R/Analyses-fixed?id=9316;user=guest;selectid=9316

The suggested fix was to modify the contour levels by a dispersion 
factor, implemented as follows:

dispersion <- if (isGlm)
   summary(x)$dispersion
else 1

cl.h <- sqrt(crit * p * (1 - hh)/hh * dispersion)    [1]

where crit is the value of Cook's distance along the contour, p is the 
number of parameters and hh is a vector of hat values. It is clear that 
  in the case of a normal linear model, the cl.h values will depend on 
whether the model was fitted by lm or glm. The following example 
illustrates this:

par(mfrow = c(2, 2))
plot(lm(y ~ ., data = freeny), 4:5)
plot(glm(y ~ ., data = freeny), 4:5)

For the lm fit we have

crit = (cl.h^2 * hh)/(p * (1 - hh))

where cl.h is on the scale of the standardized residuals. This is the 
Cook's distance as defined e.g. in the Cook & Weisberg reference on 
?cooks.distance.

For the glm fit we have

crit = (cl.h^2 * hh)/(p * (1 - hh) * summary(x)$dispersion)

which is incorrect.

For glms, as defined in the William's reference on ?cooks.distance, the 
Cook's distance is given by

C = (rs^2 * hh)/(p * (1 - hh))

where rs is the standardized *Pearson* residual, i.e.

rs = residuals(x, "pearson")/sqrt(summary(x)$dispersion * (1 - hh))

However plot 5 of plot.lm plots the standardized *deviance* residuals. 
Therefore, for non-Gaussian models, contours based on [1], even with 
"dispersion" = 1, will be on the wrong scale.

Since the deviance residuals are usually quite similar to the Pearson 
residuals, the contours may not be far off, but the difference accounts 
for the inconsistency reported in PR#9316.

The solution is simply to plot the standardised Pearson residuals 
instead, as in the attached patch.

Best regards,

Heather

-- 
Dr H Turner
Research Fellow
Dept. of Statistics
The University of Warwick
Coventry
CV4 7AL

Tel: 024 76575870
Fax: 024 76524532
Url: www.warwick.ac.uk/go/heatherturner

--------------030304040002000407020206
Content-Type: text/x-patch;
 name="plot.lm.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="plot.lm.diff"

Index: src/library/stats/R/plot.lm.R
===================================================================
--- src/library/stats/R/plot.lm.R	(revision 45649)
+++ src/library/stats/R/plot.lm.R	(working copy)
@@ -55,7 +55,7 @@
             else cooks.distance(x, sd = s, res = r)
 	}
     }
-    if (any(show[c(2:3,5)])) {
+    if (any(show[2:3])) {
 	ylab23 <- if (isGlm)
 	    "Std. deviance resid."
 	else "Standardized residuals"
@@ -167,7 +167,11 @@
 	    text.id(show.r, cook[show.r], show.r, adj.x=FALSE)
     }
     if (show[5]) {
-	ylim <- range(rs, na.rm = TRUE)
+        rps <- residuals(x, "pearson")/(s * sqrt(1 - hii))
+        ylab5 <- if (isGlm)
+            "Std. Pearson resid."
+        else "Standardized residuals"
+	ylim <- range(rps, na.rm = TRUE)
 	if (id.n > 0) {
 	    ylim <- extendrange(r= ylim, f = 0.08)
 	    show.r <- order(-cook)[iid]
@@ -197,15 +201,15 @@
                 facval[ord] <- facval
                 xx <- facval # for use in do.plot section.
 
-                plot(facval, rs, xlim = c(-1/2, sum((nlev-1) * ff) + 1/2),
+                plot(facval, rps, xlim = c(-1/2, sum((nlev-1) * ff) + 1/2),
                      ylim = ylim, xaxt = "n",
                      main = main, xlab = "Factor Level Combinations",
-                     ylab = ylab23, type = "n", ...)
+                     ylab = ylab5, type = "n", ...)
                 axis(1, at = ff[1]*(1:nlev[1] - 1/2) - 1/2,
                      labels= x$xlevels[[1]][order(sapply(split(yh,mf[,1]), mean))])
                 mtext(paste(facvars[1],":"), side = 1, line = 0.25, adj=-.05)
                 abline(v = ff[1]*(0:nlev[1]) - 1/2, col="gray", lty="F4")
-                panel(facval, rs, ...)
+                panel(facval, rps, ...)
                 abline(h = 0, lty = 3, col = "gray")
             }
 	    else { # no factors
@@ -221,21 +225,20 @@
             ## omit hatvalues of 1.
             xx[xx >= 1] <- NA
 
-            plot(xx, rs, xlim = c(0, max(xx, na.rm = TRUE)), ylim = ylim,
-                 main = main, xlab = "Leverage", ylab = ylab23, type = "n",
+            plot(xx, rps, xlim = c(0, max(xx, na.rm = TRUE)), ylim = ylim,
+                 main = main, xlab = "Leverage", ylab = ylab5, type = "n",
                  ...)
-            panel(xx, rs, ...)
+            panel(xx, rps, ...)
             abline(h = 0, v = 0, lty = 3, col = "gray")
             if (one.fig)
                 title(sub = sub.caption, ...)
             if(length(cook.levels)) {
-                dispersion <- if(isGlm) summary(x)$dispersion else 1
                 p <- length(coef(x))
                 usr <- par("usr")
                 hh <- seq.int(min(r.hat[1], r.hat[2]/100), usr[2],
                               length.out = 101)
                 for(crit in cook.levels) {
-                    cl.h <- sqrt(crit*p*(1-hh)/hh * dispersion)
+                    cl.h <- sqrt(crit*p*(1-hh)/hh)
                     lines(hh, cl.h, lty = 2, col = 2)
                     lines(hh,-cl.h, lty = 2, col = 2)
                 }
@@ -254,7 +257,7 @@
 	if (do.plot) {
 	    mtext(caption[5], 3, 0.25, cex = cex.caption)
 	    if (id.n > 0) {
-		y.id <- rs[show.r]
+		y.id <- rps[show.r]
 		y.id[y.id < 0] <- y.id[y.id < 0] - strheight(" ")/3
 		text.id(xx[show.r], y.id, show.r)
 	    }

--------------030304040002000407020206--



More information about the R-devel mailing list