[Rd] ks.test - The two-sample two-sided Kolmogorov-Smirnov test with ties (PR#13848)

Wed Jul 22 22:45:12 CEST 2009

Full_Name: Thomas Waterhouse
Version: 2.9.1
OS: OS X 10.5.7
Submission from: (NULL) (216.239.45.4)

ks.test uses a biased approximation to the p-value in the case of the two-sample
test with ties in the pooled data.  This has been justified in R in the past by
the argument that the KS test assumes a continuous distribution.  But the
two-sample test can be extended to arbitrary distributions by a combinatorial
argument (below), which I would like to see implemented in R.

The p-value of the test is simply the probability that a random assignment of
the pooled data to two sets of the sizes of the input data sets has a KS test
statistic at least as great as that of the input data.  The function psmirnov2x
in ks.c calculates this probability by enumerating all such assignments as
lattice walks, except that it doesn't know how to handle tied data points.

The correct procedure can be deduced by considering that steps in such lattice
walks represent steps along the x-axis in computing the empirical CDFs of the
two data sets.  The trick is to consider all occurrences of a repeated data
point together, when determining whether a path exceeds the input statistic or
not.  A FORTRAN implementation of this algorithm was published by Nikiforov
(http://www.ams.sunysb.edu/~nkfr/5PUBL.HTM), but it's not for the faint of
heart.

It would be straightforward to change psmirnov2x to include this logic, but I
found it easier simply to rewrite it (with comments this time!) as follows:

void
psmirnov2x(double *x, Sint *m, Sint *n, double *pool)
{
/* Inputs:  *x is the value of the KS test statistic (from 0.0 to 1.0)
            *m and *n are the lengths of each data set
            *pool is an array of length *m + *n formed by combining the two
            data sets and sorting in increasing order
   Output:  *x is unity minus the p-value of the test
   Notes:     The p-value of the two-sample KS test is the proportion of all
            assignments of the pooled data to sets of sizes *m and *n which
            have test statistic exceeding the test statistic of the actual
data.
              To calculate this, we interpret each assignment as a lattice walk
            from the origin to (*m, *n), where the i-th step represents the
i-th
            data point in the pooled data set.  A step to the right represents
            assigning a data point to the *m set, and a step up represents
            assigning a data point to the *n set.  The difference between the
            empirical CDFs of each set can be computed at each step, and the
            KS test statistic is simply the maximum of this value along a given
            walk.
              The p-value is obtained by counting the number of walks that do
            not exceed the input value (since this is easier than counting the
            number that do) and dividing by the total number of walks; this is
            what we return to the caller.
              If the same value occurs multiple times in the pooled data, we
            need to count all multiplicites at the same time when computing the
            empirical CDFs.  In terms of the lattice walk, this means we look
            only the final occurence of each value when determining whether or
            not a walk exceeds the input statistic.
 */
    double md, nd, q, *u, w;
    Sint i, j;

    if (*x == 0.0) {
        return 1.0;
    }

    md = (double) (*m);
    nd = (double) (*n);
    q = floor(*x * md * nd - 1e-7) / (md * nd);  /* just *x minus an epsilon */
    u = (double *) R_alloc(*n + 1, sizeof(double));

    for (i = 0; i <= *m; i++) {
        for (j = 0; j <= *n; j++) {
            /* At each step, rescale by w := j/(j+*m); this is equivalent to
               dividing by C(*m+*n, *m) at the end, but without the risk of
               overflow */
            w = (double)(j) / (double)(j + *m);
            if (i + j == 0) {
                /* Start with 1 walk from the origin to the origin */
                u[j] = 1.0;
            } else if (i + j < *m + *n &&
                       pool[i + j - 1] != pool[i + j] &&
                       fabs(i / md - j / nd) > q) {
                /* This walk meets or exceeds *x, so don't count it */
                u[j] = 0.0;
            } else if (i == 0) {
                /* The *n-edge:  The number of walks to (0, j) is the same as
                   the number of walks to (0, j-1) */
                u[j] = u[j - 1] * w;
            } else if (j == 0) {
                /* The *m-edge:  The number of walks to (i, 0) is the same as
                   the number of walks to (i-1, 0) */
                /* no-op */
            } else {
                /* The general case:  The number of walks to (i, j) is the
                   number of walks to (i-1, j) plus the number of walks to
                   (i, j-1) */
                u[j] = u[j] + u[j - 1] * w;
            }
        }
    }
    /* Return the proportion of walks that do not exceed the input statistic:
*/
    *x = u[*n];
}

ks.test.R must also be changed as per the following diff, which includes a
corrected calculation of the test statistic:
51d50
<         TIES <- FALSE
54,59c53,57
<         z <- cumsum(ifelse(order(w) <= n.x, 1 / n.x, - 1 / n.y))
<         if(length(unique(w)) < (n.x + n.y)) {
<             warning("cannot compute correct p-values with ties")
<             z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)]
<             TIES <- TRUE
<         }
---
> 		steps = which(diff(sort(w)) != 0)
> 		if(length(steps) > 0)
> 			z <- cumsum(ifelse(order(w) <= n.x, 1 / n.x, - 1 / n.y))[steps]
> 		else
> 			z <- c(0)
68c66
<         if(exact && (alternative == "two.sided") && !TIES)
---
>         if(exact && (alternative == "two.sided"))
72a71
> 						   sort(w),

Finally, the changed signature of psmirnov2x needs to be reflected in ctest.h
and init.c.

I have built and tested the above implementation against 20 pairs of data sets
for which I calculated the test statistics and p-values by brute force or
Nikiforov's algorithm.  I can provide these test sets if desired.