[Rd] bias issue in sample() (PR 17494)

Kirill Müller kr|m|r+m| @end|ng |rom m@||box@org
Tue Feb 26 12:35:52 CET 2019


Ralf


I don't doubt this is expected with the current implementation, I doubt 
the implementation is desirable. Suggesting to turn this to

pbirthday(1e6, classes = 2^53)
## [1] 5.550956e-05

(which is still non-zero, but much less likely to cause confusion.)


Best regards

Kirill

On 26.02.19 10:18, Ralf Stubner wrote:
> Kirill,
>
> I think some level of collision is actually expected! R uses a 32bit MT
> that can produce 2^32 different doubles. The probability for a collision
> within a million draws is
>
>> pbirthday(1e6, classes = 2^32)
> [1] 1
>
> Greetings
> Ralf
>
>
> On 26.02.19 07:06, Kirill Müller wrote:
>> Gabe
>>
>>
>> As mentioned on Twitter, I think the following behavior should be fixed
>> as part of the upcoming changes:
>>
>> R.version.string
>> ## [1] "R Under development (unstable) (2019-02-25 r76160)"
>> .Machine$double.digits
>> ## [1] 53
>> set.seed(123)
>> RNGkind()
>> ## [1] "Mersenne-Twister" "Inversion"        "Rejection"
>> length(table(runif(1e6)))
>> ## [1] 999863
>>
>> I don't expect any collisions when using Mersenne-Twister to generate a
>> million floating point values. I'm not sure what causes this behavior,
>> but it's documented in ?Random:
>>
>> "Do not rely on randomness of low-order bits from RNGs. Most of the
>> supplied uniform generators return 32-bit integer values that are
>> converted to doubles, so they take at most 2^32 distinct values and long
>> runs will return duplicated values (Wichmann-Hill is the exception, and
>> all give at least 30 varying bits.)"
>>
>> The "Wichman-Hill" bit is interesting:
>>
>> RNGkind("Wichmann-Hill")
>> length(table(runif(1e6)))
>> ## [1] 1000000
>> length(table(runif(1e6)))
>> ## [1] 1000000
>>
>> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
>> it would be great to see the above behavior also for Mersenne-Twister.
>> Thanks for considering.
>>
>>
>> Best regards
>>
>> Kirill
>>
>>
>> On 20.02.19 08:01, Gabriel Becker wrote:
>>> Luke,
>>>
>>> I'm happy to help with this. Its great to see this get tackled (I've
>>> cc'ed
>>> Kelli Ottoboni who helped flag this issue).
>>>
>>> I can prepare a patch for the RNGkind related stuff and the doc update.
>>>
>>> As for ???, what are your (and others') thoughts about the possibility of
>>> a) a reproducibility API which takes either an R version (or maybe
>>> alternatively a date) and sets the RNGkind to the default for that
>>> version/date, and/or b) that sessionInfo be modified to capture (and
>>> display) the RNGkind in effect.
>>>
>>> Best,
>>> ~G
>>>
>>>
>>> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <luke-tierney using uiowa.edu>
>>> wrote:
>>>
>>>> Before the next release we really should to sort out the bias issue in
>>>> sample() reported by Ottoboni and Stark in
>>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
>>>> filed aa a bug report by Duncan Murdoch at
>>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>>>>
>>>> Here are two examples of bad behavior through current R-devel:
>>>>
>>>>        set.seed(123)
>>>>        m <- (2/5) * 2^32
>>>>        x <- sample(m, 1000000, replace = TRUE)
>>>>        table(x %% 2, x > m / 2)
>>>>        ##
>>>>        ##    FALSE   TRUE
>>>>        ## 0 300620 198792
>>>>        ## 1 200196 300392
>>>>
>>>>        table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2)
>>>>        ##
>>>>        ##      0      1
>>>>        ## 429054 570946
>>>>
>>>> I committed a modification to R_unif_index to address this by
>>>> generating random bits (blocks of 16) and rejection sampling, but for
>>>> now this is only enabled if the environment variable R_NEW_SAMPLE is
>>>> set before the first call.
>>>>
>>>> Some things still needed:
>>>>
>>>> - someone to look over the change and see if there are any issues
>>>> - adjustment of RNGkind to allowing the old behavior to be selected
>>>> - make the new behavior the default
>>>> - adjust documentation
>>>> - ???
>>>>
>>>> Unfortunately I don't have enough free cycles to do this, but I can
>>>> help if someone else can take the lead.
>>>>
>>>> There are two other places I found that might suffer from the same
>>>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
>>>> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
>>>> have done that for walker_ProbSampleReplace, but the wilcox change
>>>> might need adjusting to support the standalone math library and I
>>>> don't feel confident enough I'd get that right.
>>>>
>>>> Best,
>>>>
>>>> luke
>>>>
>>>>
>>>> -- 
>>>> Luke Tierney
>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>> University of Iowa                  Phone:             319-335-3386
>>>> Department of Statistics and        Fax:               319-335-3017
>>>>       Actuarial Science
>>>> 241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>>>
>>>> ______________________________________________
>>>> R-devel using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>      [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list