[Rd] bug in cor.test(method = "spearman")

Petr Savicky savicky at cs.cas.cz
Mon Jan 26 17:16:00 CET 2009


Hi All:

The p-value of Spearman's rank correlation test is calculated using algorithm
AS 89. However, the way how AS 89 is used incures error, which may be an order
of magnitude larger than the error of the original algorithm.

In case of no ties AS 89 expects an even value of the statistic
  S = sum((rank(x)-rank(y))^2)
Clearly, S is even, since there is an even number of odd differences.
We will also need the fact that S ranges between 0 and 2*n*(n^2 - 1)/6,
where n = length(x) = length(y).

In some situations, R calls AS 89 with an odd value of S. Namely,
if S < n*(n^2 - 1)/6 (this corresponds to a positive correlation), then
the R call of AS 89 uses as input for computing the lower tail the odd
value S+1 instead of the correct S+2 (see the explanation below why this
makes a difference). For values S > n*(n^2 - 1)/6 (negative correlation),
the R call uses the upper tail of the distribution with input S, which
is correct.

The fact that in case of no ties AS 89 expects an even S may be seen in
the C code. The file R-patched/src/library/stats/src/prho.c, lines 90 - 93
contains the following comment ("is" is the parametr containing the value of S)
    /* NOT rounding to even anymore:  with ties, S, may even be non-integer!
     * js = *is;
     * if(fmod(js, 2.) != 0.) ++js;
     */
Older versions of the algorithm (the original fortran code is available
at http://lib.stat.cmu.edu/apstat/89) performed a correction by increasing a
possible odd value of "is" by one to an even number. Since now the correction
is removed, the calling program should guarantee an even value of S in case of
no ties.

The original AS 89 calculates only the upper tail. R uses a modification, where
the lower tail is computed in the C code as the complement of the upper tail.
Hence, in order to get P(S <= s), we can call AS 89 with the input s+2 and ask
for the complement of the upper tail, since
  P(S <= s) = P(S < s+2) = 1 - P(S >= s+2),
where S denotes the random variable and s is an even number.

For the true distribution, we could equivalently use
  P(S <= s) = P(S < s+1) = 1 - P(S >= s+1),
since P(S >= s+1) = P(S >= s+2).

However, the approximation is a continuous function and its value in s+1 is
roughly a linear interpolation between its values in s and s+2, so it is usually
different from its value in s+2. Since the approximation is done under the
assumption that the input is even, the correct approximation is P(S >= s+2).

The following table contains the maximum absolute error of the original
algorithm and of the R implementation for n = 12, ... ,22. For each n, the
maximum error is computed over those values of S which satisfy S < n*(n^2 - 1)/6
(positive correlation) and for which the correct p-value of the two-sided
test is in the range [0.01, 0.1]. 

  n    err.orig   err.R      err.R/err.orig

  12   3.83e-04   2.75e-03    7.18
  13   2.24e-04   2.10e-03    9.39
  14   1.43e-04   1.75e-03   12.24
  15   9.52e-05   1.46e-03   15.32
  16   6.85e-05   1.26e-03   18.45
  17   5.18e-05   1.08e-03   20.89
  18   4.07e-05   9.40e-04   23.09
  19   3.28e-05   8.11e-04   24.73
  20   2.66e-05   7.11e-04   26.73
  21   2.16e-05   6.31e-04   29.22
  22   1.77e-05   5.62e-04   31.77

On the side of negative correlation (S > n*(n^2 - 1)/6), there is no difference
between the original algorithm and its R implementation. The table of the
maximum errors for the same range of p-values on the side of negative correlation
is as follows

  n    err.orig   err.R      err.R/err.orig

  12   3.83e-04   3.83e-04   1
  13   2.24e-04   2.24e-04   1
  14   1.43e-04   1.43e-04   1
  15   9.52e-05   9.52e-05   1
  16   6.85e-05   6.85e-05   1
  17   5.18e-05   5.18e-05   1
  18   4.07e-05   4.07e-05   1
  19   3.28e-05   3.28e-05   1
  20   2.66e-05   2.66e-05   1
  21   2.16e-05   2.16e-05   1
  22   1.77e-05   1.77e-05   1

The value n = 22 is the largest n, for which I have the exact probabilities.

A patch correcting the problem follows. It is indented with 2 spaces. An intact
version of the patch is in an attachment.

  diff --minimal -U 4 -r R-patched/src/library/stats/R/cor.test.R R-cor.test/src/library/stats/R/cor.test.R
  --- R-patched/src/library/stats/R/cor.test.R	2008-10-21 10:29:37.000000000 +0200
  +++ R-cor.test/src/library/stats/R/cor.test.R	2009-01-26 16:58:59.123291744 +0100
  @@ -151,9 +151,9 @@
                   pspearman <- function(q, n, lower.tail = TRUE) {
                       if(n <= 1290 && exact) # n*(n^2 - 1) does not overflow
                           .C("prho",
                              as.integer(n),
  -                           as.double(round(q) + lower.tail),
  +                           as.double(round(q) + 2*lower.tail),
                              p = double(1),
                              integer(1),
                              as.logical(lower.tail),
                              PACKAGE = "stats")$p

Petr.

-------------- next part --------------
diff --minimal -U 4 -r R-patched/src/library/stats/R/cor.test.R R-cor.test/src/library/stats/R/cor.test.R
--- R-patched/src/library/stats/R/cor.test.R	2008-10-21 10:29:37.000000000 +0200
+++ R-cor.test/src/library/stats/R/cor.test.R	2009-01-26 16:58:59.123291744 +0100
@@ -151,9 +151,9 @@
                 pspearman <- function(q, n, lower.tail = TRUE) {
                     if(n <= 1290 && exact) # n*(n^2 - 1) does not overflow
                         .C("prho",
                            as.integer(n),
-                           as.double(round(q) + lower.tail),
+                           as.double(round(q) + 2*lower.tail),
                            p = double(1),
                            integer(1),
                            as.logical(lower.tail),
                            PACKAGE = "stats")$p


More information about the R-devel mailing list