[Rd] predictable bit patterns in runif(n) shortly after set.seed
Petr Savicky
savicky at cs.cas.cz
Wed Oct 17 18:11:16 CEST 2007
Mersenne Twister generator is known to be sensitive to the
algorithm used to generate its initial state. The initialization
used in R generates the initial state in a way, which leaves
linear dependencies mod 2 among the bits in the initial state.
Since Mersenne Twister performs only operations, which are
linear mod 2, these dependencies propagate to the output sequence.
An easy to see consequence of this may be demonstrated by
the following script:
pattern <- function(m=1400)
{
x <- runif(m)
y <- matrix(nrow=m,ncol=32)
for (j in 1:32) {
y[,j] <- floor(2*x)
x <- 2*x - y[,j]
}
u <- rep(0,times=32)
u[c(3,7,10,14,21,25,32)] <- 1
c(y %*% u) %% 2
}
RNGkind("default") # or set.seed() with any seed
z <- pattern()
abs(diff(z,lag=2)) # sequence with long constant subsequences
It should be pointed out that it is indeed a consequence of the
initialization. If e.g. runif(10000) is run after RNGkind/set.seed,
then the effect disappears.
Note that each row in matrix y used in function pattern() contains
the bit representation of one of the numbers from runif(m).
Different elements of z are derived from different rows in y and,
hence, from different elements of runif(m). Consequently, they should
mimic an i.i.d. sequence. However, abs(diff(z,lag=2)) allows to
reject such an assumption easily.
The pattern is even better visible graphically, for example using
for (i in 1:20) {
set.seed(i)
z <- pattern()
z <- abs(diff(z,lag=2))
if (i == 1) {
plot(cumsum(2*z-1),type="l")
} else {
lines(cumsum(2*z-1))
}
}
The resulting curve is almost the same for all the seeds.
I have a working patch, which solves this problem by adding
a new generator called Mersenne-Twister-52, which is the
standard Mersenne Twister with the following modifications:
- It uses MRG32k5a by P.L'Ecuyer for generating the initial state
(This generator works modulo odd primes and so does not generate
dependencies of the kind to which Mersenne Twister is sensitive.)
- Combines 26 bits of two consequtive numbers into a single number
with 52 random bits (this explains its name) and adds a constant
shift 2^-53 to guarantee that the result is always in (0,1).
Combining the two changes together allows to keep the current Mersenne
Twister implementation intact for backward compatibility and provides
more reasons to add a new name than just a different seeding.
In my opinion, there may be applications, which can benefit from
more then 32 random bits in the numbers from runif(n).
I would be pleased to send the patch to R-devel, if the proposed
solution is of the sort, which could be considered.
Petr Savicky.
More information about the R-devel
mailing list