[Rd] Suggestions to speed up median() and has.na()

Mon Apr 10 19:37:54 CEST 2006

Hi,

I've got two suggestions how to speed up median() about 50%.  For all
iterative methods calling median() in the loops this has a major
impact.  The second suggestion will apply to other methods too.

This is what the functions look like today:

> median
function (x, na.rm = FALSE)
{
    if (is.factor(x) || mode(x) != "numeric")
        stop("need numeric data")
    if (na.rm)
        x <- x[!is.na(x)]
    else if (any(is.na(x)))
        return(NA)
    n <- length(x)
    if (n == 0)
        return(NA)
    half <- (n + 1)/2
    if (n%%2 == 1) {
        sort(x, partial = half)[half]
    }
    else {
        sum(sort(x, partial = c(half, half + 1))[c(half, half +
            1)])/2
    }
}
<environment: namespace:stats>

Suggestion 1:
Replace the sort() calls with the .Internal(psort(x, partial)).   This
will avoid unnecessary overhead, especially an expensive second check
for NAs using any(is.na(x)).  Simple benchmarking with

x <- rnorm(10e6)
system.time(median(x))/system.time(median2(x))

where median2() is the function with the above replacements, gives
about 20-25% speed up.

Suggestion 2:
Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
as soon as a NA value is detected.  In the best case it returns after
the first index with TRUE, in the worst case it returns after the last
index N with FALSE.  The cost for is.na(x) is always O(N), and any()
in the best case O(1) and in the worst case O(N) (if any() is
implemented as I hope).  An has.na() function would be very useful
elsewhere too.

An poor mans alternative to (2), is to have a third alternative to
'na.rm', say, NA, which indicates that we know that there are no NAs
in 'x'.

The original median() is approx 50% slower (naive benchmarking) than a
version with the above two improvements, if passing a large 'x' with
no NAs;

median2 <- function (x, na.rm = FALSE) {
    if (is.factor(x) || mode(x) != "numeric")
        stop("need numeric data")

    if (is.na(na.rm)) {
    } else if (na.rm)
        x <- x[!is.na(x)]
    else if (any(is.na(x)))
        return(NA)

    n <- length(x)
    if (n == 0)
        return(NA)
    half <- (n + 1)/2
    if (n%%2 == 1) {
        .Internal(psort(x, half))[half]
    }
    else {
        sum(.Internal(psort(x, c(half, half + 1)))[c(half, half + 1)])/2
    }
}

x <- rnorm(10e5)
K <- 10
t0 <- system.time({
  for (kk in 1:K)
    y <- median(x);
})
print(t0)  # [1] 1.82 0.14 1.98   NA   NA
t1 <- system.time({
  for (kk in 1:K)
    y <- median2(x, na.rm=NA);
})
print(t1)  # [1] 1.25 0.06 1.34   NA   NA
print(t0/t1)  # [1] 1.456000 2.333333 1.477612       NA       NA

BTW, without having checked the source code, it looks like is.na() is
unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
a vector without NAs.  On the other hand, is.na(sum(x)) becomes
awfully slow if 'x' contains NAs.

/Henrik