[Rd] portable parallel seeds project: request for critiques

Petr Savicky savicky at cs.cas.cz
Sat Feb 18 11:40:02 CET 2012


On Fri, Feb 17, 2012 at 09:33:33PM -0600, Paul Johnson wrote:
[...]
> The seed things I'm using are the 6 integer values from the L'Ecuyer.
> If you run the example script, the verbose option causes some to print
> out.  The first 3 seeds in a saved project seeds file looks like:
> 
> > projSeeds[[1]]
> [[1]]
> [1]         407   376488316  1939487821  1433925148 -1040698333   579503880
> [7]  -624878918
> 
> [[2]]
> [1]         407 -1107332181   854177397  1773099324  1774170776  -266687360
> [7]   816955059
> 
> [[3]]
> [1]         407   936506900 -1924631332 -1380363206  2109234517  1585239833
> [7] -1559304513
> 
> The 407 in the first position is an integer R uses to note the type of
> stream for which the seed is intended, in this case R'Lecuyer.

The paralel streams uses MRG32k3a generator by P.L'Ecuyer, whose original
implementation stores its internal state as double precision floating point
numbers. The vector .Random.seed consists of integers. Do you know, whether
a modified implementation of MRG32k3a is used, which works with integers
internally, or the conversion between double and integer is done whenever
the state is stored to .Random.seed?
 
> I don't have any formal proof that a "good quality hash function"
> would truly create seeds from which independent streams will be drawn.

One of the suggested ways how to generate, say, 2^16 cryptographically secure
random numbers, is to use a counter producing a sequence 1, 2, 3, ..., 2^16
and transform these values by a hash function. An example is Fortuna generator

  http://en.wikipedia.org/wiki/Fortuna_(PRNG)

which is also available on CRAN as "randaes". The length of the sequence 
is limited, since the sequence contains no collisions. If the sequence is
too long, this could allow to distinguish it from truly random numbers.
After generating 2^16 numbers, the seed is recomputed and another 2^16
numbers are generated.

Such a generator is a very good one. It is not used for simulations, since it
is slower (say by a factor of 5) than the linear generators like Mersenne
Twister, WELL or MRG family of generators. However, if it is used only for
initialization, then the speed is less important.

> There is, however, the proof in the L'Ecuyer paper that one can take
> the long stream and divide it into sections.  That's the approach I'm
> taking here. Its the same approach the a parallel package in R
> follows, and parallel frameworks like snow.

According to

  http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf

the sectioning of the stream to substreams is done by jump ahead algorithm.
The new seed is far enough in the sequence from the previous seed, so it
is guaranteed that the sequence generated from the previous seed is shorter
than the jump and the subsequences do not overlap. However, the new seed
is computable as a function of the previous seed. If we use this strategy
to produce a matrix of seeds

  s_{1,1}, s_{1,2}, ...
  s_{2,1}, s_{2,2}, ...
  s_{3,1}, s_{2,2}, ...

then each s_{i,j} may be computed from s_{1,1} and i, j. In this case, it is
sufficient to store s_{1,1}. If we know for each run the indices i,j, then
we can compute s_{i,j} by an efficient algorithm.

> The different thing in my approach is that I'm saving one row of seeds
> per simulation "run".  So each run can be replicated exactly.
> 
> I hope.

Saving .Random.seed should be a safe strategy.

Petr Savicky.



More information about the R-devel mailing list