[Rd] Bias in R's random integers?

Fri Sep 21 15:15:44 CEST 2018

On 9/20/18 5:15 PM, Duncan Murdoch wrote:
> On 20/09/2018 6:59 AM, Ralf Stubner wrote:
>> It is difficult to do this in a package, since R does not provide access
>> to the random bits generated by the RNG. Only a float in (0,1) is
>> available via unif_rand(). 
> 
> I believe it is safe to multiply the unif_rand() value by 2^32, and take
> the whole number part as an unsigned 32 bit integer.  Depending on the
> RNG in use, that will give at least 25 random bits.  (The low order bits
> are the questionable ones.  25 is just a guess, not a guarantee.)

Right, the RNGs in R produce no more than 32bits, so the conversion to a
double can be reverted. If we ignore those RNGs that produce less than
32bits for the moment, then the attached file contains a sample
implementation (without long vectors, weighted sampling or hashing). It
uses Rcpp for convenience, but I have tried to keep the C++ low.
Interesting results:

The results for "simple" sampling are the same.

> set.seed(42)

> sample.int(6, 10, replace = TRUE)
 [1] 6 6 2 5 4 4 5 1 4 5

> sample.int(100, 10)
 [1] 46 72 92 25 45 90 98 11 44 51

> set.seed(42)

> sample_int(6, 10, replace = TRUE)
 [1] 6 6 2 5 4 4 5 1 4 5

> sample_int(100, 10)
 [1] 46 72 92 25 45 90 98 11 44 51

But there is no bias with the alternative method:

> m <- ceiling((2/5)*2^32)

> set.seed(42)

> x <- sample.int(m, 1000000, replace = TRUE)

> table(x %% 2)

     0      1
467768 532232

> set.seed(42)

> y <- sample_int(m, 1000000, replace = TRUE)

> table(y %% 2)

     0      1
500586 499414

The differences are also visible when sampling only a few values from
'm' possible values:

> set.seed(42)

> sample.int(m, 6, replace = TRUE)
[1] 1571624817 1609883303  491583978 1426698159 1102510407  891800051

> set.seed(42)

> sample_int(m, 6, replace = TRUE)
[1]  491583978 1426698159 1102510407  891800051 1265449090  231355453

When sampling from 'm', performance is not so good since we often have
to get a second random number:

> bench::mark(orig = sample.int(m, 1000000, replace = TRUE),
+             new  = sample_int(m, 1000000, replace = TRUE),
+             check = FALSE)
# A tibble: 2 x 14
  expression     min    mean  median   max `itr/sec` mem_alloc  n_gc n_itr
  <chr>      <bch:t> <bch:t> <bch:t> <bch>     <dbl> <bch:byt> <dbl> <int>
1 orig        8.15ms  8.67ms  8.43ms  10ms     115.     3.82MB     4    52
2 new        25.21ms 25.58ms 25.45ms  27ms      39.1    3.82MB     2    18
# ... with 5 more variables: total_time <bch:tm>, result <list>, memory
<list>,
#   time <list>, gc <list>

When sampling from fewer values, the difference is much less pronounced:

> bench::mark(orig = sample.int(6, 1000000, replace = TRUE),
+             new  = sample_int(6, 1000000, replace = TRUE),
+             check = FALSE)
# A tibble: 2 x 14
  expression     min    mean  median     max `itr/sec` mem_alloc  n_gc n_itr
  <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>
1 orig        8.14ms  8.44ms  8.29ms  9.58ms     118.     3.82MB     4    54
2 new        11.13ms 11.66ms 11.23ms 12.98ms      85.8    3.82MB     3    39
# ... with 5 more variables: total_time <bch:tm>, result <list>, memory
<list>,
#   time <list>, gc <list>

> Another useful diagnostic is
> 
>   plot(density(y[y %% 2 == 0]))
> 
> Obviously that should give a more or less uniform density, but for
> values near m, the default sample() gives some nice pretty pictures of
> quite non-uniform densities.

Indeed. Adding/subtracting numbers < 10 to/from 'm'  gives "interesting"
curves.

> By the way, there are actually quite a few examples of very large m
> besides m = (2/5)*2^32 where performance of sample() is noticeably bad.
> You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a)
> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
> 3a)*2^32, etc.
> 
> So perhaps I'm starting to be convinced that the default sample() should
> be fixed.

I have the impression that Lemire's method gives the same results unless
it is correcting for the bias that exists in the current method. If that
is really the case, then the disruption should be rather minor. The
ability to fall back to the old behavior would still be useful, though.

cheerio
ralf

-- 
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Potsdam

T: +49 331 23 61 93 11
F: +49 331 23 61 93 90
M: +49 162 20 91 196
Mail: ralf.stubner using daqana.com

Sitz: Potsdam
Register: AG Potsdam HRB 27966 P
Ust.-IdNr.: DE300072622
Geschäftsführer: Prof. Dr. Dr. Karl-Kuno Kunze

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20180921/10d5df07/attachment.sig>