[Rd] Bias in R's random integers?

Ben Bolker bbolker @ending from gm@il@com
Thu Sep 20 01:17:30 CEST 2018


  A quick point of order here: arguing with Duncan in this forum is
helpful to expose ideas, but probably neither side will convince the
other; eventually, if you want this adopted in core R, you'll need to
convince an R-core member to pursue this fix.

  In the meantime, a good, well-tested implementation in a
user-contributed package (presumably written in C for speed) would be
enormously useful.  Volunteers ... ?



On 2018-09-19 04:19 PM, Duncan Murdoch wrote:
> On 19/09/2018 3:52 PM, Philip B. Stark wrote:
>> Hi Duncan--
>>
>> Nice simulation!
>>
>> The absolute difference in probabilities is small, but the maximum
>> relative difference grows from something negligible to almost 2 as m
>> approaches 2**31.
>>
>> Because the L_1 distance between the uniform distribution on {1, ...,
>> m} and what you actually get is large, there have to be test functions
>> whose expectations are quite different under the two distributions. 
> 
> That is a mathematically true statement, but I suspect it is not very
> relevant.  Pseudo-random number generators always have test functions
> whose sample averages are quite different from the expectation under the
> true distribution.  Remember Von Neumann's "state of sin" quote.  The
> bug in sample() just means it is easier to find such a function than it
> would otherwise be.
> 
> The practical question is whether such a function is likely to arise in
> practice or not.
> 
>> Whether those correspond to commonly used statistics or not, I have no
>> idea.
> 
> I am pretty confident that this bug rarely matters.
> 
>> Regarding backwards compatibility: as a user, I'd rather the default
>> sample() do the best possible thing, and take an extra step to use
>> something like sample(..., legacy=TRUE) if I want to reproduce old
>> results.
> 
> I suspect there's a good chance the bug I discovered today (non-integer
> x values not being truncated) will be declared to be a feature, and the
> documentation will be changed.  Then the rejection sampling approach
> would need to be quite a bit more complicated.
> 
> I think a documentation warning about the accuracy of sampling
> probabilities would also be a sufficient fix here, and would be quite a
> bit less trouble than changing the default sample().  But as I said in
> my original post, a contribution of a function without this bug would be
> a nice addition.
> 
> Duncan Murdoch
> 
>>
>> Regards,
>> Philip
>>
>> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
>> <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>> wrote:
>>
>>     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>>      > No, the 2nd call only happens when m > 2**31. Here's the code:
>>
>>     Yes, you're right. Sorry!
>>
>>     So the ratio really does come close to 2.  However, the difference in
>>     probabilities between outcomes is still at most 2^-32 when m is less
>>     than that cutoff.  That's not feasible to detect; the only detectable
>>     difference would happen if some event was constructed to hold an
>>     abundance of outcomes with especially low (or especially high)
>>     probability.
>>
>>     As I said in my original post, it's probably not hard to construct
>> such
>>     a thing, but as I've said more recently, it probably wouldn't
>> happen by
>>     chance.  Here's one attempt to do it:
>>
>>     Call the values from unif_rand() "the unif_rand() outcomes".  Call
>> the
>>     values from sample() the sample outcomes.
>>
>>     It would be easiest to see the error if half of the sample() outcomes
>>     used two unif_rand() outcomes, and half used just one.  That would
>> mean
>>     m should be (2/3) * 2^32, but that's too big and would trigger the
>>     other
>>     version.
>>
>>     So how about half use 2 unif_rands(), and half use 3?  That means m =
>>     (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes
>>     would
>>     alternate between the two possibilities, so our event could be even
>>     versus odd outcomes.
>>
>>     Let's try it:
>>
>>       > m <- (2/5)*2^32
>>       > m > 2^31
>>     [1] FALSE
>>       > x <- sample(m, 1000000, replace = TRUE)
>>       > table(x %% 2)
>>
>>            0      1
>>     399850 600150
>>
>>     Since m is an even number, the true proportions of evens and odds
>>     should
>>     be exactly 0.5.  That's some pretty strong evidence of the bug in the
>>     generator.  (Note that the ratio of the observed probabilities is
>> about
>>     1.5, so I may not be the first person to have done this.)
>>
>>     I'm still not convinced that there has ever been a simulation run
>> with
>>     detectable bias compared to Monte Carlo error unless it (like this
>> one)
>>     was designed specifically to show the problem.
>>
>>     Duncan Murdoch
>>
>>      >
>>      > (RNG.c, lines 793ff)
>>      >
>>      > double R_unif_index(double dn)
>>      > {
>>      >      double cut = INT_MAX;
>>      >
>>      >      switch(RNG_kind) {
>>      >      case KNUTH_TAOCP:
>>      >      case USER_UNIF:
>>      >      case KNUTH_TAOCP2:
>>      > cut = 33554431.0; /* 2^25 - 1 */
>>      > break;
>>      >      default:
>>      > break;
>>      >     }
>>      >
>>      >      double u = dn > cut ? ru() : unif_rand();
>>      >      return floor(dn * u);
>>      > }
>>      >
>>      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>>     <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
>>      > <mailto:murdoch.duncan using gmail.com
>>     <mailto:murdoch.duncan using gmail.com>>> wrote:
>>      >
>>      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>>      >      > The 53 bits only encode at most 2^{32} possible values,
>>     because the
>>      >      > source of the float is the output of a 32-bit PRNG (the
>>     obsolete
>>      >     version
>>      >      > of MT). 53 bits isn't the relevant number here.
>>      >
>>      >     No, two calls to unif_rand() are used.  There are two 32 bit
>>     values,
>>      >     but
>>      >     some of the bits are thrown away.
>>      >
>>      >     Duncan Murdoch
>>      >
>>      >      >
>>      >      > The selection ratios can get close to 2. Computer
>> scientists
>>      >     don't do it
>>      >      > the way R does, for a reason.
>>      >      >
>>      >      > Regards,
>>      >      > Philip
>>      >      >
>>      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>>      >     <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
>>     <mailto:murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>
>>      >      > <mailto:murdoch.duncan using gmail.com
>>     <mailto:murdoch.duncan using gmail.com>
>>      >     <mailto:murdoch.duncan using gmail.com
>>     <mailto:murdoch.duncan using gmail.com>>>> wrote:
>>      >      >
>>      >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>>      >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
>>      >      >      > (<murdoch.duncan using gmail.com
>>     <mailto:murdoch.duncan using gmail.com>
>>      >     <mailto:murdoch.duncan using gmail.com
>>     <mailto:murdoch.duncan using gmail.com>> <mailto:murdoch.duncan using gmail.com
>>     <mailto:murdoch.duncan using gmail.com>
>>      >     <mailto:murdoch.duncan using gmail.com
>>     <mailto:murdoch.duncan using gmail.com>>>>)
>>      >      >     escribió:
>>      >      >      >>
>>      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>>      >      >      >>> Dear list,
>>      >      >      >>>
>>      >      >      >>> It looks to me that R samples random integers
>>     using an
>>      >      >     intuitive but biased
>>      >      >      >>> algorithm by going from a random number on
>> [0,1) from
>>      >     the PRNG
>>      >      >     to a random
>>      >      >      >>> integer, e.g.
>>      >      >      >>>
>>      >      >
>>     https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>>      >      >      >>>
>>      >      >      >>> Many other languages use various rejection
>> sampling
>>      >     approaches
>>      >      >     which
>>      >      >      >>> provide an unbiased method for sampling, such as
>>     in Go,
>>      >     python,
>>      >      >     and others
>>      >      >      >>> described here:
>> https://arxiv.org/abs/1805.10941 (I
>>      >     believe the
>>      >      >     biased
>>      >      >      >>> algorithm currently used in R is also described
>>     there).  I'm
>>      >      >     not an expert
>>      >      >      >>> in this area, but does it make sense for the R to
>>     adopt
>>      >     one of
>>      >      >     the unbiased
>>      >      >      >>> random sample algorithms outlined there and used
>>     in other
>>      >      >     languages?  Would
>>      >      >      >>> a patch providing such an algorithm be welcome?
>> What
>>      >     concerns
>>      >      >     would need to
>>      >      >      >>> be addressed first?
>>      >      >      >>>
>>      >      >      >>> I believe this issue was also raised by Killie &
>>     Philip in
>>      >      >      >>>
>>      > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>>      >      >     more
>>      >      >      >>> recently in
>>      >      >      >>>
>>      >      >
>>      >
>>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>>    
>> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>>      >       
>>  <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>>      >      >
>>      >         
>>  <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>>      >      >      >>> pointing to the python implementation for
>> comparison:
>>      >      >      >>>
>>      >      >
>>      >
>>    
>> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>>
>>      >      >      >>
>>      >      >      >> I think the analyses are correct, but I doubt if a
>>     change
>>      >     to the
>>      >      >     default
>>      >      >      >> is likely to be accepted as it would make it more
>>      >     difficult to
>>      >      >     reproduce
>>      >      >      >> older results.
>>      >      >      >>
>>      >      >      >> On the other hand, a contribution of a new
>>     function like
>>      >      >     sample() but
>>      >      >      >> not suffering from the bias would be good.  The
>>     normal way to
>>      >      >     make such
>>      >      >      >> a contribution is in a user contributed package.
>>      >      >      >>
>>      >      >      >> By the way, R code illustrating the bias is
>>     probably not very
>>      >      >     hard to
>>      >      >      >> put together.  I believe the bias manifests
>> itself in
>>      >     sample()
>>      >      >     producing
>>      >      >      >> values with two different probabilities (instead
>>     of all equal
>>      >      >      >> probabilities).  Those may differ by as much as
>>     one part in
>>      >      >     2^32.  It's
>>      >      >      >
>>      >      >      > According to Kellie and Philip, in the attachment
>>     of the
>>      >     thread
>>      >      >      > referenced by Carl, "The maximum ratio of selection
>>      >     probabilities can
>>      >      >      > get as large as 1.5 if n is just below 2^31".
>>      >      >
>>      >      >     Sorry, I didn't write very well.  I meant to say
>> that the
>>      >     difference in
>>      >      >     probabilities would be 2^-32, not that the ratio of
>>      >     probabilities would
>>      >      >     be 1 + 2^-32.
>>      >      >
>>      >      >     By the way, I don't see the statement giving the
>> ratio as
>>      >     1.5, but
>>      >      >     maybe
>>      >      >     I was looking in the wrong place.  In Theorem 1 of the
>>     paper
>>      >     I was
>>      >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that
>>     formula
>>      >     m is your
>>      >      >     n.  If it is near 2^31, R uses w = 57 random bits, so
>>     the ratio
>>      >      >     would be
>>      >      >     very, very small (one part in 2^25).
>>      >      >
>>      >      >     The worst case for R would happen when m  is just below
>>      >     2^25, where w
>>      >      >     is at least 31 for the default generators.  In that
>>     case the
>>      >     ratio
>>      >      >     could
>>      >      >     be about 1.03.
>>      >      >
>>      >      >     Duncan Murdoch
>>      >      >
>>      >      >
>>      >      >
>>      >      > --
>>      >      > Philip B. Stark | Associate Dean, Mathematical and Physical
>>      >     Sciences |
>>      >      > Professor,  Department of Statistics |
>>      >      > University of California
>>      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>>      > statistics.berkeley.edu/~stark
>>     <http://statistics.berkeley.edu/%7Estark>
>>      >     <http://statistics.berkeley.edu/%7Estark>
>>      >      > <http://statistics.berkeley.edu/%7Estark> |
>>      >      > @philipbstark
>>      >      >
>>      >
>>      >
>>      >
>>      > --
>>      > Philip B. Stark | Associate Dean, Mathematical and Physical
>>     Sciences |
>>      > Professor,  Department of Statistics |
>>      > University of California
>>      > Berkeley, CA 94720-3860 | 510-394-5077 |
>>     statistics.berkeley.edu/~stark
>>     <http://statistics.berkeley.edu/%7Estark>
>>      > <http://statistics.berkeley.edu/%7Estark> |
>>      > @philipbstark
>>      >
>>
>>
>>
>> -- 
>> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
>> Professor,  Department of Statistics |
>> University of California
>> Berkeley, CA 94720-3860 | 510-394-5077 |
>> statistics.berkeley.edu/~stark
>> <http://statistics.berkeley.edu/%7Estark> |
>> @philipbstark
>>
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list