[Rd] bias issue in sample() (PR 17494)

Ralf Stubner r@||@@tubner @end|ng |rom d@q@n@@com
Tue Feb 26 10:18:41 CET 2019


Kirill,

I think some level of collision is actually expected! R uses a 32bit MT
that can produce 2^32 different doubles. The probability for a collision
within a million draws is

> pbirthday(1e6, classes = 2^32)
[1] 1

Greetings
Ralf


On 26.02.19 07:06, Kirill Müller wrote:
> Gabe
> 
> 
> As mentioned on Twitter, I think the following behavior should be fixed
> as part of the upcoming changes:
> 
> R.version.string
> ## [1] "R Under development (unstable) (2019-02-25 r76160)"
> .Machine$double.digits
> ## [1] 53
> set.seed(123)
> RNGkind()
> ## [1] "Mersenne-Twister" "Inversion"        "Rejection"
> length(table(runif(1e6)))
> ## [1] 999863
> 
> I don't expect any collisions when using Mersenne-Twister to generate a
> million floating point values. I'm not sure what causes this behavior,
> but it's documented in ?Random:
> 
> "Do not rely on randomness of low-order bits from RNGs. Most of the
> supplied uniform generators return 32-bit integer values that are
> converted to doubles, so they take at most 2^32 distinct values and long
> runs will return duplicated values (Wichmann-Hill is the exception, and
> all give at least 30 varying bits.)"
> 
> The "Wichman-Hill" bit is interesting:
> 
> RNGkind("Wichmann-Hill")
> length(table(runif(1e6)))
> ## [1] 1000000
> length(table(runif(1e6)))
> ## [1] 1000000
> 
> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
> it would be great to see the above behavior also for Mersenne-Twister.
> Thanks for considering.
> 
> 
> Best regards
> 
> Kirill
> 
> 
> On 20.02.19 08:01, Gabriel Becker wrote:
>> Luke,
>>
>> I'm happy to help with this. Its great to see this get tackled (I've
>> cc'ed
>> Kelli Ottoboni who helped flag this issue).
>>
>> I can prepare a patch for the RNGkind related stuff and the doc update.
>>
>> As for ???, what are your (and others') thoughts about the possibility of
>> a) a reproducibility API which takes either an R version (or maybe
>> alternatively a date) and sets the RNGkind to the default for that
>> version/date, and/or b) that sessionInfo be modified to capture (and
>> display) the RNGkind in effect.
>>
>> Best,
>> ~G
>>
>>
>> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <luke-tierney using uiowa.edu>
>> wrote:
>>
>>> Before the next release we really should to sort out the bias issue in
>>> sample() reported by Ottoboni and Stark in
>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
>>> filed aa a bug report by Duncan Murdoch at
>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>>>
>>> Here are two examples of bad behavior through current R-devel:
>>>
>>>       set.seed(123)
>>>       m <- (2/5) * 2^32
>>>       x <- sample(m, 1000000, replace = TRUE)
>>>       table(x %% 2, x > m / 2)
>>>       ##
>>>       ##    FALSE   TRUE
>>>       ## 0 300620 198792
>>>       ## 1 200196 300392
>>>
>>>       table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2)
>>>       ##
>>>       ##      0      1
>>>       ## 429054 570946
>>>
>>> I committed a modification to R_unif_index to address this by
>>> generating random bits (blocks of 16) and rejection sampling, but for
>>> now this is only enabled if the environment variable R_NEW_SAMPLE is
>>> set before the first call.
>>>
>>> Some things still needed:
>>>
>>> - someone to look over the change and see if there are any issues
>>> - adjustment of RNGkind to allowing the old behavior to be selected
>>> - make the new behavior the default
>>> - adjust documentation
>>> - ???
>>>
>>> Unfortunately I don't have enough free cycles to do this, but I can
>>> help if someone else can take the lead.
>>>
>>> There are two other places I found that might suffer from the same
>>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
>>> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
>>> have done that for walker_ProbSampleReplace, but the wilcox change
>>> might need adjusting to support the standalone math library and I
>>> don't feel confident enough I'd get that right.
>>>
>>> Best,
>>>
>>> luke
>>>
>>>
>>> -- 
>>> Luke Tierney
>>> Ralph E. Wareham Professor of Mathematical Sciences
>>> University of Iowa                  Phone:             319-335-3386
>>> Department of Statistics and        Fax:               319-335-3017
>>>      Actuarial Science
>>> 241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Potsdam

T: +49 331 23 61 93 11
F: +49 331 23 61 93 90
M: +49 162 20 91 196
Mail: ralf.stubner using daqana.com

Sitz: Potsdam
Register: AG Potsdam HRB 27966
Ust.-IdNr.: DE300072622
Geschäftsführer: Dr.-Ing. Stefan Knirsch, Prof. Dr. Dr. Karl-Kuno Kunze


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20190226/2b4562c2/attachment.sig>


More information about the R-devel mailing list