[Rd] sum() and mean() for (ALTREP) integer sequences

Thu Sep 2 21:54:18 CEST 2021

Hello,

Please, find a long response below.

== difference between mean(x) and sum(x)/length(x) ==

At the core, mean(x) and sum(x)/length(x) works very differently for real numbers.

Mean is more accurate when applied to a vector with a small variance but a very high mean, especially on platforms without extended precision numbers (i.e. long double is 64 bits rather than 80 bits).

On a Windows 10 computer (with 80 bits long double):

k=1e6
base=2^51
realmean = mean(1:k)
sqa=(base+1):(base+k) # with ALTREP
sq=(base+1):(base+k)+0.0 # without ALTREP
mean(sq) - base - realmean  # correctly returns zero
sum(sq)/k - base - realmean # incorrectly returns -0.5
sum(sqa)/k - base - realmean # correctly returns zero

On a GNU/Linux x86_64 computer with extended precision floating point disabled, the difference is worse:
mean(sq) - base - realmean  # correctly returns zero
sum(sq)/k - base - realmean # incorrectly returns -1180
sum(sqa)/k - base - realmean # correctly returns zero

Therefore (without ALTREP) sum can be inaccurate, due to floating point rounding errors.

The good accuracy of mean() is due to a two-passes algorithm of the real_mean() function in src/main/summary.c.

The algorithm can be summarized by the following equivalent R code:
badmean=function(v) {sum(v)/length(v)}
goodmean=function(v) {center = badmean(v); center + badmean(v - center)}
goodmean(sq) - base - realmean # correctly returns zero

== should mean() call ALTINTEGER_SUM ? ==
As you noticed in the examples above, sum() is much more accurate with ALTREPs than with the naive algorithm, because there are no cumulative rounding errors.

Moreover, if we focus on INTSXP, the maximum possible value is lower : INT_MAX = 2^31-1

The sum of a large sequence (e.g. 1:(2^31-1)) can still overflow the exact integer range (0 to 2^53) of an FP64, and the division does not always round back to the correct value.

bad = 1:189812531L
mean(bad) - sum(bad)/length(bad) # returns -1.5e-08 on a platform with FP80
mean(bad) == 94906266 # correct value (the actual result is an integer)

The implementation of mean() on INTSXP do not use the two-passes trick, but relies on a LDOUBLE (typically FP80 on the x86 platform) that is large enough to avoid rounding bugs even with huge integer sequences.

Unfortunately the ALTINTEGER_SUM interface returns at best a FP64, and so, would not return the FP64 closest to the actual mean for some sequences.

Adding a MEAN function to the ALTREP interface could solve the problem.

== can mean performance be improved easily ? ==

The mean() implementation for integers, supports ALTREPs with poor iteration performances, using the slow INTEGER_ELT() macro.

Moreover, it converts each integer to LDOUBLE, which is slow.

It can be improved using ITERATE_BY_REGION0 and using a 64 bits integer (if available) that cannot overflow on small batches (size = 512).

# before patching (on Ubuntu x86_64 Silvermont Celeron J1900)
x = 1:1e8
y = 1:1e8+0L
system.time(mean(x)) # user 1.33 second
system.time(mean(y)) # user 0.32 second

# after patching (on Ubuntu x86_64 Silvermont Celeron J1900)
# after patching (on Ubuntu x86_64 Silvermont Celeron J1900)
x = 1:1e8
y = 1:1e8+0L
system.time(mean(x)) # user 0.29 second # more than x4 faster
system.time(mean(y)) # user 0.18 second # x 1.7 faster !

(patch attached to this message)

The patch is not optimal. It should ideally use isum(), and risum() but these functions are a mess, needing refactoring. For instance, they take a 'call' argument and may display a warning message related to the sum() function.

--
Sincerely
Andre GILLIBERT
________________________________
De : R-devel <r-devel-bounces using r-project.org> de la part de Viechtbauer, Wolfgang (SP) <wolfgang.viechtbauer using maastrichtuniversity.nl>
Envoyé : jeudi 2 septembre 2021 12:55:03
À : r-devel using r-project.org
Objet : [Rd] sum() and mean() for (ALTREP) integer sequences

ATTENTION: Cet e-mail provient d’une adresse mail extérieure au CHU de Rouen. Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes à moins de connaître l'expéditeur et de savoir que le contenu est sûr. En cas de doute, transférer le mail à « DSI, Sécurité » pour analyse. Merci de votre vigilance

Hi all,

I am trying to understand the performance of functions applied to integer sequences. Consider the following:

### begin example ###

library(lobstr)
library(microbenchmark)

x <- sample(1e6)
obj_size(x)
# 4,000,048 B

y <- 1:1e6
obj_size(y)
# 680 B

# So we can see that 'y' uses ALTREP. These are, as expected, the same:

sum(x)
# [1] 500000500000
sum(y)
# [1] 500000500000

# For 'x', we have to go through the trouble of actually summing up 1e6 integers.
# For 'y', knowing its form, we really just need to do:

1e6*(1e6+1)/2
# [1] 500000500000

# which should be a whole lot faster. And indeed, it is:

microbenchmark(sum(x),sum(y))

# Unit: nanoseconds
#    expr    min       lq      mean   median       uq    max neval cld
#  sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519   100   b
#  sum(y)    183    245.5    446.09    338.5    447.0   3233   100  a

# Now what about mean()?

mean(x)
# [1] 500000.5
mean(y)
# [1] 500000.5

# which is the same as

(1e6+1)/2
# [1] 500000.5

# But this surprised me:

microbenchmark(mean(x),mean(y))

# Unit: microseconds
#     expr      min        lq     mean   median       uq      max neval cld
#  mean(x)  935.389  943.4795 1021.423  954.689  985.122 2065.974   100  a
#  mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768   100   b

### end example ###

So why is mean() on an ALTREP sequence slower when sum() is faster?

And more generally, when using sum() on an ALTREP integer sequence, does R actually use something like n*(n+1)/2 (or generalized to sequences a:b -- (a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()?

Best,
Wolfgang

______________________________________________
R-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel