[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Martin Maechler maechler at stat.math.ethz.ch
Tue May 30 09:33:10 CEST 2017


>>>>> Serguei Sokol <sokol at insa-toulouse.fr>
>>>>>     on Mon, 29 May 2017 15:28:12 +0200 writes:

    > Sorry, I have seen it too late that we had different tab
    > width in the original file and my editor.  Here is the
    > patch with all white spaces instead of mixing tabs and
    > white spaces.

thank you - it still gives quite a few unnecessary (i.e. "white space")
 diffs with the source code, but that's not the problem :

The patch leads to correct results for the simple
 (1:k, 1:k)  data sets (for all k).

However, even after the patch,
The example from the SO post differs from the result of Richie
Cotton's function...(even though that function had a silly bug
	 in step 1, the bug has not been "kicking" for the example):

Here's a fixed-up version of the pure R function and
the example and some comments :


## From Stackoverflow
## http://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line
## median_median_line by Richie Cotton (July 12 2010, last edited at 13:49)
##
## Shorter variable names, fixed bug in step 1, added 'plot.' option: Martin Maechler, May 2017
MMline <- function(x, y, data, plot. = FALSE)
{
  if(!missing(data))
  {
    x <- eval(substitute(x), data)
    y <- eval(substitute(y), data)
  }
  stopifnot((n <- length(x)) == length(y), n >= 2)

  ## Step 1
  n.3 <- n %/% 3L
  groups <- rep(1:3, times = switch((n %% 3) + 1L,
     n.3,
     c(n.3, n.3 + 1L, n.3),
     c(n.3 + 1L, n.3, n.3 + 1L)
  ))
  groups <- lapply(list(groups), as.factor) # (gain a bit in tapply())

  ## Step 2
  ## sort *both*  x & y "along x":
  x <- sort.int(x, index.return = TRUE)
  y <- y[x$ix]
  x <- x$x
  if(plot.) plot(x,y)

  ## Step 3
  m_x <- tapply(x, groups, median)
  m_y <- tapply(y, groups, median)
  if(plot.) {
      points(m_x, m_y, cex=2, pch=3, col="red")
      segments(m_x[1],m_y[1], m_x[3],m_y[3], col="red")
  }
  ## Step 4
  R <- if(n == 2) 2L else 3L
  slope <- (m_y[[R]] - m_y[[1]]) / (m_x[[R]] - m_x[[1]])
  intercept <- m_y[[1]] - slope * m_x[[1]]

  ## Step 5
  mid_y <- intercept + slope * m_x[[2]]
  intercept <- intercept + (m_y[[2]] - mid_y) / 3
  if(plot.) abline(a = intercept, b = slope, col=adjustcolor("midnight blue", .5), lwd=2)
  c(intercept = intercept, slope = slope)
}

## To test it, here's the second example from that page:

dfr <- data.frame(
      time = c( .16,  .24,  .25,  .30,  .30,    .32,  .36,  .36,  .50,  .50,
                .57,  .61,  .61,  .68,  .72,    .72,  .83,  .88,  .89),
  distance = c(12.1, 29.8, 32.7, 42.8, 44.2,   55.8, 63.5, 65.1,124.6,129.7,
              150.2,182.2,189.4,220.4,250.4,  261.0,334.5,375.5,399.1))

MMline(time, distance, dfr, plot.=TRUE)
## intercept    slope
##   -113.6     520.0
par(new=TRUE)# should plot identically!
MMline(time, distance, dfr[sample.int(nrow(dfr)), ], plot.=TRUE)

## Note the slightly odd way of specifying the groups. The instructions are
## quite picky about how you define group sizes, so the more obvious method
## of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work.

## edited Jul 12 '10 at 13:49 / answered Jul 12 '10 at 13:36
## Richie Cotton

## And someone remarked  that R's  line() did not give the same:

with(dfr, line(time, distance))
## ...
## Coefficients:
## [1]  -108.8   505.2
abline(-108.8, 505.2, col="blue") ##=> this one is wrong

## MM:
(cfs <- t(sapply(setNames(,2:50), function(k) {x <- 1:k; MMline(x,x)})))
##    intercept slope
## 2          0     1 # (special case fixed)
## 3          0     1
## 4          0     1
## 5          0     1
##   ............
## 49         0     1
## 50         0     1

## "Harder": (sort ing!)
cf2 <- t(sapply(setNames(,2:50),
                function(k) {x <- sample.int(k); MMline(x, -10*x)}))
stopifnot(abs(cf2[,"intercept"] -   0) < 1e-12,
          abs(cf2[,    "slope"] - -10) < 1e-12)



More information about the R-devel mailing list