[Rd] Bias in R's random integers?

David Hugh-Jones d@vidhughjone@ @ending from gm@il@com
Wed Sep 19 23:57:10 CEST 2018


It doesn't seem too hard to come up with plausible ways in which this could
give bad results. Suppose I sample rows from a large dataset, maybe for
bootstrapping. Suppose the rows are non-randomly ordered, e.g. odd rows are
males, even rows are females. Oops! Very non-representative sample,
bootstrap p values are garbage.

David

On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan using gmail.com>
wrote:

> On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> > Hi Duncan--
> >
> >
>
> That is a mathematically true statement, but I suspect it is not very
> relevant.  Pseudo-random number generators always have test functions
> whose sample averages are quite different from the expectation under the
> true distribution.  Remember Von Neumann's "state of sin" quote.  The
> bug in sample() just means it is easier to find such a function than it
> would otherwise be.
>
> The practical question is whether such a function is likely to arise in
> practice or not.
>
>  > Whether those correspond to commonly used statistics or not, I have no
>  > idea.
>
> I am pretty confident that this bug rarely matters.
>
> > Regarding backwards compatibility: as a user, I'd rather the default
> > sample() do the best possible thing, and take an extra step to use
> > something like sample(..., legacy=TRUE) if I want to reproduce old
> results.
>
> I suspect there's a good chance the bug I discovered today (non-integer
> x values not being truncated) will be declared to be a feature, and the
> documentation will be changed.  Then the rejection sampling approach
> would need to be quite a bit more complicated.
>
> I think a documentation warning about the accuracy of sampling
> probabilities would also be a sufficient fix here, and would be quite a
> bit less trouble than changing the default sample().  But as I said in
> my original post, a contribution of a function without this bug would be
> a nice addition.
>
> Duncan Murdoch
>
> >
> > Regards,
> > Philip
> >
> > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>> wrote:
> >
> >     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> >      > No, the 2nd call only happens when m > 2**31. Here's the code:
> >
> >     Yes, you're right. Sorry!
> >
> >     So the ratio really does come close to 2.  However, the difference in
> >     probabilities between outcomes is still at most 2^-32 when m is less
> >     than that cutoff.  That's not feasible to detect; the only detectable
> >     difference would happen if some event was constructed to hold an
> >     abundance of outcomes with especially low (or especially high)
> >     probability.
> >
> >     As I said in my original post, it's probably not hard to construct
> such
> >     a thing, but as I've said more recently, it probably wouldn't happen
> by
> >     chance.  Here's one attempt to do it:
> >
> >     Call the values from unif_rand() "the unif_rand() outcomes".  Call
> the
> >     values from sample() the sample outcomes.
> >
> >     It would be easiest to see the error if half of the sample() outcomes
> >     used two unif_rand() outcomes, and half used just one.  That would
> mean
> >     m should be (2/3) * 2^32, but that's too big and would trigger the
> >     other
> >     version.
> >
> >     So how about half use 2 unif_rands(), and half use 3?  That means m =
> >     (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes
> >     would
> >     alternate between the two possibilities, so our event could be even
> >     versus odd outcomes.
> >
> >     Let's try it:
> >
> >       > m <- (2/5)*2^32
> >       > m > 2^31
> >     [1] FALSE
> >       > x <- sample(m, 1000000, replace = TRUE)
> >       > table(x %% 2)
> >
> >            0      1
> >     399850 600150
> >
> >     Since m is an even number, the true proportions of evens and odds
> >     should
> >     be exactly 0.5.  That's some pretty strong evidence of the bug in the
> >     generator.  (Note that the ratio of the observed probabilities is
> about
> >     1.5, so I may not be the first person to have done this.)
> >
> >     I'm still not convinced that there has ever been a simulation run
> with
> >     detectable bias compared to Monte Carlo error unless it (like this
> one)
> >     was designed specifically to show the problem.
> >
> >     Duncan Murdoch
> >
> >      >
> >      > (RNG.c, lines 793ff)
> >      >
> >      > double R_unif_index(double dn)
> >      > {
> >      >      double cut = INT_MAX;
> >      >
> >      >      switch(RNG_kind) {
> >      >      case KNUTH_TAOCP:
> >      >      case USER_UNIF:
> >      >      case KNUTH_TAOCP2:
> >      > cut = 33554431.0; /* 2^25 - 1 */
> >      > break;
> >      >      default:
> >      > break;
> >      >     }
> >      >
> >      >      double u = dn > cut ? ru() : unif_rand();
> >      >      return floor(dn * u);
> >      > }
> >      >
> >      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> >     <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> >      > <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>> wrote:
> >      >
> >      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >      >      > The 53 bits only encode at most 2^{32} possible values,
> >     because the
> >      >      > source of the float is the output of a 32-bit PRNG (the
> >     obsolete
> >      >     version
> >      >      > of MT). 53 bits isn't the relevant number here.
> >      >
> >      >     No, two calls to unif_rand() are used.  There are two 32 bit
> >     values,
> >      >     but
> >      >     some of the bits are thrown away.
> >      >
> >      >     Duncan Murdoch
> >      >
> >      >      >
> >      >      > The selection ratios can get close to 2. Computer
> scientists
> >      >     don't do it
> >      >      > the way R does, for a reason.
> >      >      >
> >      >      > Regards,
> >      >      > Philip
> >      >      >
> >      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> >      >     <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> >     <mailto:murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>
> >      >      > <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>>> wrote:
> >      >      >
> >      >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> >      >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> >      >      >      > (<murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>> <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>>>)
> >      >      >     escribió:
> >      >      >      >>
> >      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >      >      >      >>> Dear list,
> >      >      >      >>>
> >      >      >      >>> It looks to me that R samples random integers
> >     using an
> >      >      >     intuitive but biased
> >      >      >      >>> algorithm by going from a random number on [0,1)
> from
> >      >     the PRNG
> >      >      >     to a random
> >      >      >      >>> integer, e.g.
> >      >      >      >>>
> >      >      >
> >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >      >      >>>
> >      >      >      >>> Many other languages use various rejection
> sampling
> >      >     approaches
> >      >      >     which
> >      >      >      >>> provide an unbiased method for sampling, such as
> >     in Go,
> >      >     python,
> >      >      >     and others
> >      >      >      >>> described here: https://arxiv.org/abs/1805.10941
> (I
> >      >     believe the
> >      >      >     biased
> >      >      >      >>> algorithm currently used in R is also described
> >     there).  I'm
> >      >      >     not an expert
> >      >      >      >>> in this area, but does it make sense for the R to
> >     adopt
> >      >     one of
> >      >      >     the unbiased
> >      >      >      >>> random sample algorithms outlined there and used
> >     in other
> >      >      >     languages?  Would
> >      >      >      >>> a patch providing such an algorithm be welcome?
> What
> >      >     concerns
> >      >      >     would need to
> >      >      >      >>> be addressed first?
> >      >      >      >>>
> >      >      >      >>> I believe this issue was also raised by Killie &
> >     Philip in
> >      >      >      >>>
> >      > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >      >      >     more
> >      >      >      >>> recently in
> >      >      >      >>>
> >      >      >
> >      >
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >      >
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >      >      >>> pointing to the python implementation for
> comparison:
> >      >      >      >>>
> >      >      >
> >      >
> >
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >      >      >>
> >      >      >      >> I think the analyses are correct, but I doubt if a
> >     change
> >      >     to the
> >      >      >     default
> >      >      >      >> is likely to be accepted as it would make it more
> >      >     difficult to
> >      >      >     reproduce
> >      >      >      >> older results.
> >      >      >      >>
> >      >      >      >> On the other hand, a contribution of a new
> >     function like
> >      >      >     sample() but
> >      >      >      >> not suffering from the bias would be good.  The
> >     normal way to
> >      >      >     make such
> >      >      >      >> a contribution is in a user contributed package.
> >      >      >      >>
> >      >      >      >> By the way, R code illustrating the bias is
> >     probably not very
> >      >      >     hard to
> >      >      >      >> put together.  I believe the bias manifests itself
> in
> >      >     sample()
> >      >      >     producing
> >      >      >      >> values with two different probabilities (instead
> >     of all equal
> >      >      >      >> probabilities).  Those may differ by as much as
> >     one part in
> >      >      >     2^32.  It's
> >      >      >      >
> >      >      >      > According to Kellie and Philip, in the attachment
> >     of the
> >      >     thread
> >      >      >      > referenced by Carl, "The maximum ratio of selection
> >      >     probabilities can
> >      >      >      > get as large as 1.5 if n is just below 2^31".
> >      >      >
> >      >      >     Sorry, I didn't write very well.  I meant to say that
> the
> >      >     difference in
> >      >      >     probabilities would be 2^-32, not that the ratio of
> >      >     probabilities would
> >      >      >     be 1 + 2^-32.
> >      >      >
> >      >      >     By the way, I don't see the statement giving the ratio
> as
> >      >     1.5, but
> >      >      >     maybe
> >      >      >     I was looking in the wrong place.  In Theorem 1 of the
> >     paper
> >      >     I was
> >      >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that
> >     formula
> >      >     m is your
> >      >      >     n.  If it is near 2^31, R uses w = 57 random bits, so
> >     the ratio
> >      >      >     would be
> >      >      >     very, very small (one part in 2^25).
> >      >      >
> >      >      >     The worst case for R would happen when m  is just below
> >      >     2^25, where w
> >      >      >     is at least 31 for the default generators.  In that
> >     case the
> >      >     ratio
> >      >      >     could
> >      >      >     be about 1.03.
> >      >      >
> >      >      >     Duncan Murdoch
> >      >      >
> >      >      >
> >      >      >
> >      >      > --
> >      >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >      >     Sciences |
> >      >      > Professor,  Department of Statistics |
> >      >      > University of California
> >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >      > statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      >     <http://statistics.berkeley.edu/%7Estark>
> >      >      > <http://statistics.berkeley.edu/%7Estark> |
> >      >      > @philipbstark
> >      >      >
> >      >
> >      >
> >      >
> >      > --
> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >     Sciences |
> >      > Professor,  Department of Statistics |
> >      > University of California
> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >     statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      > <http://statistics.berkeley.edu/%7Estark> |
> >      > @philipbstark
> >      >
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> > Professor,  Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
-- 
Sent from Gmail Mobile

	[[alternative HTML version deleted]]



More information about the R-devel mailing list