[Rd] sum() and mean() for (ALTREP) integer sequences

Fri Sep 3 14:49:27 CEST 2021

>> Hi all,
>>
>> I am trying to understand the performance of functions applied to integer
>sequences. Consider the following:
>>
>> ### begin example ###
>>
>> library(lobstr)
>> library(microbenchmark)
>>
>> x <- sample(1e6)
>> obj_size(x)
>> # 4,000,048 B
>>
>> y <- 1:1e6
>> obj_size(y)
>> # 680 B
>>
>> # So we can see that 'y' uses ALTREP. These are, as expected, the same:
>>
>> sum(x)
>> # [1] 500000500000
>> sum(y)
>> # [1] 500000500000
>>
>> # For 'x', we have to go through the trouble of actually summing up 1e6
>integers.
>> # For 'y', knowing its form, we really just need to do:
>>
>> 1e6*(1e6+1)/2
>> # [1] 500000500000
>>
>> # which should be a whole lot faster. And indeed, it is:
>>
>> microbenchmark(sum(x),sum(y))
>>
>> # Unit: nanoseconds
>> #    expr    min       lq      mean   median       uq    max neval cld
>> #  sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519   100   b
>> #  sum(y)    183    245.5    446.09    338.5    447.0   3233   100  a
>>
>> # Now what about mean()?
>>
>> mean(x)
>> # [1] 500000.5
>> mean(y)
>> # [1] 500000.5
>>
>> # which is the same as
>>
>> (1e6+1)/2
>> # [1] 500000.5
>>
>> # But this surprised me:
>>
>> microbenchmark(mean(x),mean(y))
>>
>> # Unit: microseconds
>> #     expr      min        lq     mean   median       uq      max neval cld
>> #  mean(x)  935.389  943.4795 1021.423  954.689  985.122 2065.974   100  a
>> #  mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768   100   b
>>
>> ### end example ###
>>
>> So why is mean() on an ALTREP sequence slower when sum() is faster?
>>
>> And more generally, when using sum() on an ALTREP integer sequence, does R
>actually use something like n*(n+1)/2 (or generalized to sequences a:b --
>(a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()?
>
>The mean.default function looks like this:
>
>function (x, trim = 0, na.rm = FALSE, ...)
>{
>     if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
>         warning("argument is not numeric or logical: returning NA")
>         return(NA_real_)
>     }
>     if (na.rm)
>         x <- x[!is.na(x)]
>     if (!is.numeric(trim) || length(trim) != 1L)
>         stop("'trim' must be numeric of length one")
>     n <- length(x)
>     if (trim > 0 && n) {
>         if (is.complex(x))
>             stop("trimmed means are not defined for complex data")
>         if (anyNA(x))
>             return(NA_real_)
>         if (trim >= 0.5)
>             return(stats::median(x, na.rm = FALSE))
>         lo <- floor(n * trim) + 1
>         hi <- n + 1 - lo
>         x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
>     }
>     .Internal(mean(x))
>}
>
>So it does fixups for trimming and NA removal, then calls an internal
>function.  The internal function is the first part of do_summary, here:
>
>https://github.com/wch/r-
>source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L541-L556
>
>It is using separate functions for the mean by type.  The real_mean
>function here:
>
>https://github.com/wch/r-
>source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L476-L515
>
>makes a big effort to avoid overflows.
>
>So I suspect the reason mean.default doesn't use sum(x)/length(x) at the
>end is that on a long vector sum(x) could overflow when mean(x) shouldn't.
>
>So why not take the ALTREP into account?  I suspect it's just too much
>trouble for a rare case.
>
>Duncan Murdoch

Hi Duncan,

Thanks for pointing me to the appropriate places in the C code. And ok, makes sense.

Best,
Wolfgang