[Rd] (PR#8087) NAs by integer overflow in Spearman's test p-value

Tue Aug 23 19:19:39 CEST 2005

There is an even simpler way: someone wrote n*(n^2-1) as n*(n-1)*(n+1)
and caused the problem.

Your superfluous semicolons do definitely make your code harder to read.

On Tue, 23 Aug 2005 jtk at cmp.uea.ac.uk wrote:

> Full_Name: Jan T. Kim
> Version: 2.1.0 (and better)
> OS: Linux
> Submission from: (NULL) (139.222.3.229)
>
>
> The p value in Spearman's test is NA if the length of x exceeds 46340, due to
> an integer overflow, occurring if length(n) > sqrt(2^31):
>
>    > n <- 46341;
>    > set.seed(1);
>    > x <- runif(n);
>    > y <- runif(n);
>    > cor.test(x, y, method = "spearman");
>
>            Spearman's rank correlation rho
>
>    data:  x and y
>    S = 1.654965e+13, p-value = NA
>    alternative hypothesis: true rho is not equal to 0
>    sample estimates:
>            rho
>    0.002199426
>
>    Warning message:
>    NAs produced by integer overflow in: n * n
>
> The integer overflow occurs in src/library/stats/R/cor.test.R, and it can be
> fixed
> by converting n to double appropriately (see *** fix label ***, lines 110
> onwards
> are shown):
>
>                ## Use the test statistic S = sum(rank(x) - rank(y))^2
>                ## and AS 89 for obtaining better p-values than via the
>                ## simple normal approximation.
>                ## In the case of no ties, S = (1-rho) * (n^3-n)/6.
>                pspearman <- function(q, n, lower.tail = TRUE) {
>                    if(n <= 1290) # n*(n^2 - 1) does not overflow
>                        .C("prho",
>                           as.integer(n),
>                           as.double(q + 1),
>                           p = double(1),
>                           integer(1),
>                           as.logical(lower.tail),
>                           PACKAGE = "stats")$p
>                    else { # for large n: aymptotic t_{n-2}
>                        n <- as.double(n);  # *** fix ***
>                        r <- 1 - 6 * q / (n*(n*n - 1))
>                        pt(r / sqrt((1 - r^2)/(n-2)), df = n-2,
>                           lower.tail= !lower.tail)
>                    }
>                }
>                q <- round((n^3 - n) * (1 - r) / 6)
>                STATISTIC <- c(S = q)
>                PVAL <-
>                    switch(alternative,
>                           "two.sided" = {
>                               p <- if(q > (n^3 - n) / 6)
>                                   pspearman(q - 1, n, lower.tail = FALSE)
>                               else
>                                   pspearman(q, n, lower.tail = TRUE)
>                               min(2 * p, 1)
>                           },
>                           "greater" = pspearman(q, n, lower.tail = TRUE),
>                           "less" = pspearman(q - 1, n, lower.tail = FALSE))
>                if(TIES)
>                    warning("p-values may be incorrect due to ties")
>
> Inserting the typecast only in the pspearman function is a minimally invasive
> fix -- alternatively, replacing at line 17
>
>    n <- length(x)
>
> with
>
>    n <- as.double(length(x))
>
> achieves the same fix and may take care of other unnecessary integer overflows,
> but it may also introduce new problems e.g. in .C calls etc.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595