[R] Wichmann-Hill Random Number Generator and the Birthday Problem

Shengqiao Li shli at stat.wvu.edu
Sun Aug 17 02:29:17 CEST 2008


Dear all,

Recently I am generating large random samples (10M) and any duplicated 
numbers are not desired.
We tried several RNGs in R and found Wichmann-Hill did not produce 
duplications.

The duplication problem is the interesting birthday problem. If there are 
M possible numbers, randomly draw N numbers from them,
the average number of dupilcations D = N(N-1)/2/M.

For Knuth-TAOCP and Knuth-TAOCP-2002, M=2^30, since this modulus is used. 
D = 46566.12 for N=10M samples.

For Marsaglia-Multicarry, Super-Duper and Mersene-Twister, M=2^32. D = 
11641.53 for N = 10M samples.

My testing results (see below) agree with above analysis. But for 
Wichmann-Hill, it wasn't. Wichmann-Hill's cycle is 6.9536e12 (refer to RNG 
help by ?RNG and Whichmann's correction in 1984). Thus M <= 6.9536e12. D 
>= 7.19052 when N=1e7 and D>= 179.763 for N=5e7. 
But in my tests, duplications were surprisingly not observed.

It seems that Wichmann-Hill has a much larger cycle than the one 
documented!

Anybody can solve this puzzle?

Regards,

Shengqiao Li

Department of Statistics
West Virgina Unversity

==============Testing===================

> RNGkind(kind="Knuth-TAOCP");
> RNGkind();
[1] "Knuth-TAOCP" "Inversion"
> sum(duplicated(runif(1e7)));
[1] 46216
>
> RNGkind(kind="Knuth-TAOCP-2002");
> RNGkind();
[1] "Knuth-TAOCP-2002" "Inversion"
> sum(duplicated(runif(1e7)));
[1] 46373
>

> RNGkind(kind="Marsaglia-Multicarry");
> RNGkind();
[1] "Marsaglia-Multicarry" "Inversion"
> sum(duplicated(runif(1e7)));
[1] 11671
>
>
> RNGkind(kind="Super-Duper");
> RNGkind();
[1] "Super-Duper" "Inversion"
> sum(duplicated(runif(1e7)));
[1] 11704
>
> RNGkind(kind="Mersenne-Twister");
> RNGkind();
[1] "Mersenne-Twister" "Inversion"
> sum(duplicated(runif(1e7)));
[1] 11508
>
> RNGkind(kind="Wichmann-Hill");
> RNGkind();
[1] "Wichmann-Hill" "Inversion"
> sum(duplicated(runif(1e7)));
[1] 0

> gc()
           used (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells  199157  5.4     646480  17.3  10149018  271.1
Vcells 4497151 34.4  108714629 829.5 169585390 1293.9

> sum(duplicated(runif(5e7)))
[1] 0


> sessionInfo()
R version 2.7.1 (2008-06-23)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_2.7.1
>

==============End of Testing===================



More information about the R-help mailing list