[Rd] Bias in R's random integers?

Thu Sep 20 00:08:28 CEST 2018

On 19/09/2018 5:57 PM, David Hugh-Jones wrote:
> 
> It doesn't seem too hard to come up with plausible ways in which this 
> could give bad results. Suppose I sample rows from a large dataset, 
> maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g. 
> odd rows are males, even rows are females. Oops! Very non-representative 
> sample, bootstrap p values are garbage.

That would only happen if your dataset was exactly 1717986918 elements 
in size. (And in fact, it will be less extreme than I posted:  I had x 
set to 1717986918.4, as described in another thread.  If you use an 
integer value you need a different pattern; add or subtract an element 
or two and the pattern needed to see a problem changes drastically.)

But if you're sampling from a dataset of that exact size, then you 
should worry about this bug. Don't use sample().  Use the algorithm that 
Carl described.

Duncan Murdoch

> 
> David
> 
> On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan using gmail.com 
> <mailto:murdoch.duncan using gmail.com>> wrote:
> 
>     On 19/09/2018 3:52 PM, Philip B. Stark wrote:
>      > Hi Duncan--
>      >
>      >
> 
>     That is a mathematically true statement, but I suspect it is not very
>     relevant.  Pseudo-random number generators always have test functions
>     whose sample averages are quite different from the expectation under
>     the
>     true distribution.  Remember Von Neumann's "state of sin" quote.  The
>     bug in sample() just means it is easier to find such a function than it
>     would otherwise be.
> 
>     The practical question is whether such a function is likely to arise in
>     practice or not.
> 
>       > Whether those correspond to commonly used statistics or not, I
>     have no
>       > idea.
> 
>     I am pretty confident that this bug rarely matters.
> 
>      > Regarding backwards compatibility: as a user, I'd rather the default
>      > sample() do the best possible thing, and take an extra step to use
>      > something like sample(..., legacy=TRUE) if I want to reproduce
>     old results.
> 
>     I suspect there's a good chance the bug I discovered today (non-integer
>     x values not being truncated) will be declared to be a feature, and the
>     documentation will be changed.  Then the rejection sampling approach
>     would need to be quite a bit more complicated.
> 
>     I think a documentation warning about the accuracy of sampling
>     probabilities would also be a sufficient fix here, and would be quite a
>     bit less trouble than changing the default sample().  But as I said in
>     my original post, a contribution of a function without this bug
>     would be
>     a nice addition.
> 
>     Duncan Murdoch
> 
>      >
>      > Regards,
>      > Philip
>      >
>      > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
>     <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
>      > <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>> wrote:
>      >
>      >     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>      >      > No, the 2nd call only happens when m > 2**31. Here's the code:
>      >
>      >     Yes, you're right. Sorry!
>      >
>      >     So the ratio really does come close to 2.  However, the
>     difference in
>      >     probabilities between outcomes is still at most 2^-32 when m
>     is less
>      >     than that cutoff.  That's not feasible to detect; the only
>     detectable
>      >     difference would happen if some event was constructed to hold an
>      >     abundance of outcomes with especially low (or especially high)
>      >     probability.
>      >
>      >     As I said in my original post, it's probably not hard to
>     construct such
>      >     a thing, but as I've said more recently, it probably wouldn't
>     happen by
>      >     chance.  Here's one attempt to do it:
>      >
>      >     Call the values from unif_rand() "the unif_rand() outcomes". 
>     Call the
>      >     values from sample() the sample outcomes.
>      >
>      >     It would be easiest to see the error if half of the sample()
>     outcomes
>      >     used two unif_rand() outcomes, and half used just one.  That
>     would mean
>      >     m should be (2/3) * 2^32, but that's too big and would
>     trigger the
>      >     other
>      >     version.
>      >
>      >     So how about half use 2 unif_rands(), and half use 3?  That
>     means m =
>      >     (2/5) * 2^32 = 1717986918.  A good guess is that sample()
>     outcomes
>      >     would
>      >     alternate between the two possibilities, so our event could
>     be even
>      >     versus odd outcomes.
>      >
>      >     Let's try it:
>      >
>      >       > m <- (2/5)*2^32
>      >       > m > 2^31
>      >     [1] FALSE
>      >       > x <- sample(m, 1000000, replace = TRUE)
>      >       > table(x %% 2)
>      >
>      >            0      1
>      >     399850 600150
>      >
>      >     Since m is an even number, the true proportions of evens and odds
>      >     should
>      >     be exactly 0.5.  That's some pretty strong evidence of the
>     bug in the
>      >     generator.  (Note that the ratio of the observed
>     probabilities is about
>      >     1.5, so I may not be the first person to have done this.)
>      >
>      >     I'm still not convinced that there has ever been a simulation
>     run with
>      >     detectable bias compared to Monte Carlo error unless it (like
>     this one)
>      >     was designed specifically to show the problem.
>      >
>      >     Duncan Murdoch
>      >
>      >      >
>      >      > (RNG.c, lines 793ff)
>      >      >
>      >      > double R_unif_index(double dn)
>      >      > {
>      >      >      double cut = INT_MAX;
>      >      >
>      >      >      switch(RNG_kind) {
>      >      >      case KNUTH_TAOCP:
>      >      >      case USER_UNIF:
>      >      >      case KNUTH_TAOCP2:
>      >      > cut = 33554431.0; /* 2^25 - 1 */
>      >      > break;
>      >      >      default:
>      >      > break;
>      >      >     }
>      >      >
>      >      >      double u = dn > cut ? ru() : unif_rand();
>      >      >      return floor(dn * u);
>      >      > }
>      >      >
>      >      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>      >     <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
>     <mailto:murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>
>      >      > <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>
>      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>>> wrote:
>      >      >
>      >      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>      >      >      > The 53 bits only encode at most 2^{32} possible values,
>      >     because the
>      >      >      > source of the float is the output of a 32-bit PRNG (the
>      >     obsolete
>      >      >     version
>      >      >      > of MT). 53 bits isn't the relevant number here.
>      >      >
>      >      >     No, two calls to unif_rand() are used.  There are two
>     32 bit
>      >     values,
>      >      >     but
>      >      >     some of the bits are thrown away.
>      >      >
>      >      >     Duncan Murdoch
>      >      >
>      >      >      >
>      >      >      > The selection ratios can get close to 2. Computer
>     scientists
>      >      >     don't do it
>      >      >      > the way R does, for a reason.
>      >      >      >
>      >      >      > Regards,
>      >      >      > Philip
>      >      >      >
>      >      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>      >      >     <murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com> <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>
>      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com> <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>>
>      >      >      > <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>
>      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>
>      >      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>
>      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>>>> wrote:
>      >      >      >
>      >      >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>      >      >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan
>     Murdoch
>      >      >      >      > (<murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>
>      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>
>      >      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>
>      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>> <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>
>      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>
>      >      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>
>      >     <mailto:murdoch.duncan using gmail.com
>     <mailto:murdoch.duncan using gmail.com>>>>>)
>      >      >      >     escribió:
>      >      >      >      >>
>      >      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>      >      >      >      >>> Dear list,
>      >      >      >      >>>
>      >      >      >      >>> It looks to me that R samples random integers
>      >     using an
>      >      >      >     intuitive but biased
>      >      >      >      >>> algorithm by going from a random number on
>     [0,1) from
>      >      >     the PRNG
>      >      >      >     to a random
>      >      >      >      >>> integer, e.g.
>      >      >      >      >>>
>      >      >      >
>      > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>      >      >      >      >>>
>      >      >      >      >>> Many other languages use various rejection
>     sampling
>      >      >     approaches
>      >      >      >     which
>      >      >      >      >>> provide an unbiased method for sampling,
>     such as
>      >     in Go,
>      >      >     python,
>      >      >      >     and others
>      >      >      >      >>> described here:
>     https://arxiv.org/abs/1805.10941 (I
>      >      >     believe the
>      >      >      >     biased
>      >      >      >      >>> algorithm currently used in R is also
>     described
>      >     there).  I'm
>      >      >      >     not an expert
>      >      >      >      >>> in this area, but does it make sense for
>     the R to
>      >     adopt
>      >      >     one of
>      >      >      >     the unbiased
>      >      >      >      >>> random sample algorithms outlined there
>     and used
>      >     in other
>      >      >      >     languages?  Would
>      >      >      >      >>> a patch providing such an algorithm be
>     welcome? What
>      >      >     concerns
>      >      >      >     would need to
>      >      >      >      >>> be addressed first?
>      >      >      >      >>>
>      >      >      >      >>> I believe this issue was also raised by
>     Killie &
>      >     Philip in
>      >      >      >      >>>
>      >      >
>     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>      >      >      >     more
>      >      >      >      >>> recently in
>      >      >      >      >>>
>      >      >      >
>      >      >
>      >
>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>     <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >   
>       <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >      >
>      >     
>       <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >      >      >
>      >      >
>      >     
>       <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>      >      >      >      >>> pointing to the python implementation for
>     comparison:
>      >      >      >      >>>
>      >      >      >
>      >      >
>      >
>     https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>      >      >      >      >>
>      >      >      >      >> I think the analyses are correct, but I
>     doubt if a
>      >     change
>      >      >     to the
>      >      >      >     default
>      >      >      >      >> is likely to be accepted as it would make
>     it more
>      >      >     difficult to
>      >      >      >     reproduce
>      >      >      >      >> older results.
>      >      >      >      >>
>      >      >      >      >> On the other hand, a contribution of a new
>      >     function like
>      >      >      >     sample() but
>      >      >      >      >> not suffering from the bias would be good.  The
>      >     normal way to
>      >      >      >     make such
>      >      >      >      >> a contribution is in a user contributed
>     package.
>      >      >      >      >>
>      >      >      >      >> By the way, R code illustrating the bias is
>      >     probably not very
>      >      >      >     hard to
>      >      >      >      >> put together.  I believe the bias manifests
>     itself in
>      >      >     sample()
>      >      >      >     producing
>      >      >      >      >> values with two different probabilities
>     (instead
>      >     of all equal
>      >      >      >      >> probabilities).  Those may differ by as much as
>      >     one part in
>      >      >      >     2^32.  It's
>      >      >      >      >
>      >      >      >      > According to Kellie and Philip, in the
>     attachment
>      >     of the
>      >      >     thread
>      >      >      >      > referenced by Carl, "The maximum ratio of
>     selection
>      >      >     probabilities can
>      >      >      >      > get as large as 1.5 if n is just below 2^31".
>      >      >      >
>      >      >      >     Sorry, I didn't write very well.  I meant to
>     say that the
>      >      >     difference in
>      >      >      >     probabilities would be 2^-32, not that the ratio of
>      >      >     probabilities would
>      >      >      >     be 1 + 2^-32.
>      >      >      >
>      >      >      >     By the way, I don't see the statement giving
>     the ratio as
>      >      >     1.5, but
>      >      >      >     maybe
>      >      >      >     I was looking in the wrong place.  In Theorem 1
>     of the
>      >     paper
>      >      >     I was
>      >      >      >     looking in the ratio was "1 + m 2^{-w + 1}". 
>     In that
>      >     formula
>      >      >     m is your
>      >      >      >     n.  If it is near 2^31, R uses w = 57 random
>     bits, so
>      >     the ratio
>      >      >      >     would be
>      >      >      >     very, very small (one part in 2^25).
>      >      >      >
>      >      >      >     The worst case for R would happen when m  is
>     just below
>      >      >     2^25, where w
>      >      >      >     is at least 31 for the default generators.  In that
>      >     case the
>      >      >     ratio
>      >      >      >     could
>      >      >      >     be about 1.03.
>      >      >      >
>      >      >      >     Duncan Murdoch
>      >      >      >
>      >      >      >
>      >      >      >
>      >      >      > --
>      >      >      > Philip B. Stark | Associate Dean, Mathematical and
>     Physical
>      >      >     Sciences |
>      >      >      > Professor,  Department of Statistics |
>      >      >      > University of California
>      >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>      >      > statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      >     <http://statistics.berkeley.edu/%7Estark>
>      >      >     <http://statistics.berkeley.edu/%7Estark>
>      >      >      > <http://statistics.berkeley.edu/%7Estark> |
>      >      >      > @philipbstark
>      >      >      >
>      >      >
>      >      >
>      >      >
>      >      > --
>      >      > Philip B. Stark | Associate Dean, Mathematical and Physical
>      >     Sciences |
>      >      > Professor,  Department of Statistics |
>      >      > University of California
>      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>      > statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      >     <http://statistics.berkeley.edu/%7Estark>
>      >      > <http://statistics.berkeley.edu/%7Estark> |
>      >      > @philipbstark
>      >      >
>      >
>      >
>      >
>      > --
>      > Philip B. Stark | Associate Dean, Mathematical and Physical
>     Sciences |
>      > Professor,  Department of Statistics |
>      > University of California
>      > Berkeley, CA 94720-3860 | 510-394-5077 |
>     statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      > <http://statistics.berkeley.edu/%7Estark> |
>      > @philipbstark
>      >
> 
>     ______________________________________________
>     R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> -- 
> Sent from Gmail Mobile