[Rd] Bias in R's random integers?

Thu Sep 20 17:15:04 CEST 2018

On 20/09/2018 6:59 AM, Ralf Stubner wrote:
> On 9/20/18 1:43 AM, Carl Boettiger wrote:
>> For a well-tested C algorithm, based on my reading of Lemire, the unbiased
>> "algorithm 3" in https://arxiv.org/abs/1805.10941 is part already of the C
>> standard library in OpenBSD and macOS (as arc4random_uniform), and in the
>> GNU standard library.  Lemire also provides C++ code in the appendix of his
>> piece for both this and the faster "nearly divisionless" algorithm.
>>
>> It would be excellent if any R core members were interested in considering
>> bindings to these algorithms as a patch, or might express expectations for
>> how that patch would have to operate (e.g. re Duncan's comment about
>> non-integer arguments to sample size).  Otherwise, an R package binding
>> seems like a good starting point, but I'm not the right volunteer.
> It is difficult to do this in a package, since R does not provide access
> to the random bits generated by the RNG. Only a float in (0,1) is
> available via unif_rand(). 

I believe it is safe to multiply the unif_rand() value by 2^32, and take 
the whole number part as an unsigned 32 bit integer.  Depending on the 
RNG in use, that will give at least 25 random bits.  (The low order bits 
are the questionable ones.  25 is just a guess, not a guarantee.)

However, if one is willing to use an external
> RNG, it is of course possible. After reading about Lemire's work [1], I
> had planned to integrate such an unbiased sampling scheme into the dqrng
> package, which I have now started. [2]
> 
> Using Duncan's example, the results look much better:
> 
>> library(dqrng)
>> m <- (2/5)*2^32
>> y <- dqsample(m, 1000000, replace = TRUE)
>> table(y %% 2)
> 
>       0      1
> 500252 499748

Another useful diagnostic is

   plot(density(y[y %% 2 == 0]))

Obviously that should give a more or less uniform density, but for 
values near m, the default sample() gives some nice pretty pictures of 
quite non-uniform densities.

By the way, there are actually quite a few examples of very large m 
besides m = (2/5)*2^32 where performance of sample() is noticeably bad. 
You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) 
* 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + 
3a)*2^32, etc.

So perhaps I'm starting to be convinced that the default sample() should 
be fixed.

Duncan Murdoch

> 
> Currently I am taking the other interpretation of "truncated":
> 
>> table(dqsample(2.5, 1000000, replace = TRUE))
> 
>       0      1
> 499894 500106
> 
> I will adjust this to whatever is decided for base R.
> 
> 
> However, there is currently neither long vector nor weighted sampling
> support. And the performance without replacement is quite bad compared
> to R's algorithm with hashing.
> 
> cheerio
> ralf
> 
> [1] via http://www.pcg-random.org/posts/bounded-rands.html
> [2] https://github.com/daqana/dqrng/tree/feature/sample
> 
> 
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>