[Rd] portable parallel seeds project: request for critiques

Sat Feb 18 00:06:33 CET 2012

On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:
> I've got another edition of my simulation replication framework.  I'm
> attaching 2 R files and pasting in the readme.
> 
> I would especially like to know if I'm doing anything that breaks
> .Random.seed or other things that R's parallel uses in the
> environment.
> 
> In case you don't want to wrestle with attachments, the same files are
> online in our SVN
> 
> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/
> 
> 
> ## Paul E. Johnson CRMDA <pauljohn at ku.edu>
> ## Portable Parallel Seeds Project.
> ## 2012-02-18
> 
> Portable Parallel Seeds Project
> 
> This is how I'm going to recommend we work with random number seeds in
> simulations. It enhances work that requires runs with random numbers,
> whether runs are in a cluster computing environment or in a single
> workstation.
> 
> It is a solution for two separate problems.
> 
> Problem 1. I scripted up 1000 R runs and need high quality,
> unique, replicable random streams for each one. Each simulation
> runs separately, but I need to be confident their streams are
> not correlated or overlapping. For replication, I need to be able to
> select any run, say 667, and restart it exactly as it was.
> 
> Problem 2. I've written a Parallel MPI (Message Passing Interface)
> routine that launches 1000 runs and I need to assure each has
> a unique, replicatable, random stream. I need to be able to
> select any run, say 667, and restart it exactly as it was.
> 
> This project develops one approach to create replicable simulations.
> It blends ideas about seed management from John M. Chambers
> Software for Data Analysis (2008) with ideas from the snowFT
> package by Hana Sevcikova and Tony R. Rossini.
> 
> 
> Here's my proposal.
> 
> 1. Run a preliminary program to generate an array of seeds
> 
> run1:   seed1.1   seed1.2   seed1.3
> run2:   seed2.1   seed2.2   seed2.3
> run3:   seed3.1   seed3.2   seed3.3
> ...      ...       ...
> run1000   seed1000.1  seed1000.2   seed1000.3
> 
> This example provides 3 separate streams of random numbers within each
> run. Because we will use the L'Ecuyer "many separate streams"
> approach, we are confident that there is no correlation or overlap
> between any of the runs.
> 
> The projSeeds has to have one row per project, but it is not a huge
> file. I created seeds for 2000 runs of a project that requires 2 seeds
> per run.  The saved size of the file 104443kb, which is very small. By
> comparison, a 1400x1050 jpg image would usually be twice that size.
> If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
> still pretty small.
> 
> Because the seeds are saved in a file, we are sure each
> run can be replicated. We just have to teach each program
> how to use the seeds. That is step two.

Hi.

Some of the random number generators allow as a seed a vector,
not only a single number. This can simplify generating the seeds.
There can be one seed for each of the 1000 runs and then,
the rows of the seed matrix can be

  c(seed1, 1), c(seed1, 2), ...
  c(seed2, 1), c(seed2, 2), ...
  c(seed3, 1), c(seed3, 2), ...
  ...

There could be even only one seed and the matrix can be generated as

  c(seed, 1, 1), c(seed, 1, 2), ...
  c(seed, 2, 1), c(seed, 2, 2), ...
  c(seed, 3, 1), c(seed, 3, 2), ...

If the initialization using the vector c(seed, i, j) is done
with a good quality hash function, the runs will be independent.

What is your opinion on this?

An advantage of seeding with a vector is also that there can
be significantly more initial states of the generator among
which we select by the seed than 2^32, which is the maximum
for a single integer seed.

Petr Savicky.