[Rd] portable parallel seeds project: request for critiques

Fri Feb 17 21:57:26 CET 2012

I've got another edition of my simulation replication framework.  I'm
attaching 2 R files and pasting in the readme.

I would especially like to know if I'm doing anything that breaks
.Random.seed or other things that R's parallel uses in the
environment.

In case you don't want to wrestle with attachments, the same files are
online in our SVN

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/

## Paul E. Johnson CRMDA <pauljohn at ku.edu>
## Portable Parallel Seeds Project.
## 2012-02-18

Portable Parallel Seeds Project

This is how I'm going to recommend we work with random number seeds in
simulations. It enhances work that requires runs with random numbers,
whether runs are in a cluster computing environment or in a single
workstation.

It is a solution for two separate problems.

Problem 1. I scripted up 1000 R runs and need high quality,
unique, replicable random streams for each one. Each simulation
runs separately, but I need to be confident their streams are
not correlated or overlapping. For replication, I need to be able to
select any run, say 667, and restart it exactly as it was.

Problem 2. I've written a Parallel MPI (Message Passing Interface)
routine that launches 1000 runs and I need to assure each has
a unique, replicatable, random stream. I need to be able to
select any run, say 667, and restart it exactly as it was.

This project develops one approach to create replicable simulations.
It blends ideas about seed management from John M. Chambers
Software for Data Analysis (2008) with ideas from the snowFT
package by Hana Sevcikova and Tony R. Rossini.

Here's my proposal.

1. Run a preliminary program to generate an array of seeds

run1:   seed1.1   seed1.2   seed1.3
run2:   seed2.1   seed2.2   seed2.3
run3:   seed3.1   seed3.2   seed3.3
...      ...       ...
run1000   seed1000.1  seed1000.2   seed1000.3

This example provides 3 separate streams of random numbers within each
run. Because we will use the L'Ecuyer "many separate streams"
approach, we are confident that there is no correlation or overlap
between any of the runs.

The projSeeds has to have one row per project, but it is not a huge
file. I created seeds for 2000 runs of a project that requires 2 seeds
per run.  The saved size of the file 104443kb, which is very small. By
comparison, a 1400x1050 jpg image would usually be twice that size.
If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
still pretty small.

Because the seeds are saved in a file, we are sure each
run can be replicated. We just have to teach each program
how to use the seeds. That is step two.

2. Inside each run, an initialization function runs that loads the
seeds file and takes the row of seeds that it needs.  As the
simulation progresses, the user can ask for random numbers from the
separate streams. When we need random draws from a particular stream,
we set the variable "currentStream" with the function useStream().

The function initSeedStreams creates several objects in
the global environment. It sets the integer currentStream,
as well as two list objects, startSeeds and currentSeeds.
At the outset of the run, startSeeds and currentSeeds
are the same thing. When we change the currentStream
to a different stream, the currentSeeds vector is
updated to remember where that stream was when we stopped
drawing numbers from it.

Now, for the proof of concept. A working example.

Step 1. Create the Seeds. Review the R program

seedCreator.R

That creates the file "projSeeds.rda".

Step 2. Use one row of seeds per run.

Please review "controlledSeeds.R" to see an example usage
that I've tested on a cluster.

"controlledSeeds.R" can also be run on a single workstation for
testing purposes.  There is a variable "runningInMPI" which determines
whether the code is supposed to run on the RMPI cluster or just in a
single workstation.

The code for each run of the model begins by loading the
required libraries and loading the seed file, if it exists, or
generating a new "projSeed" object if it is not found.

library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(234234)
if (file.exists("projSeeds.rda")) {
  load("projSeeds.rda")
} else {
  source("seedCreator.R")
}

## Suppose the "run" number is:
run <- 232
initSeedStreams(run)	

After that, R's random generator functions will draw values
from the first random random stream that was initialized
in projSeeds. When each repetition (run) occurs,
R looks up the right seed for that run, and uses it.

If the user wants to begin drawing observations from the
second random stream, this command is used:

useStream(2)

If the user has drawn values from stream 1 already, but
wishes to begin again at the initial point in that stream,
use this command

useStream(1, origin = TRUE)

Question: Why is this approach better for parallel runs?

Answer: After a batch of simulations, we can re-start any
one of them and repeat it exactly. This builds on the idea
of the snowFT package, by Hana Sevcikova and A.J. Rossini.

That is different from the default approach of most R parallel
designs, including R's own parallel, RMPI and snow.

The ordinary way of controlling seeds in R parallel would initialize
the 50 nodes, and we would lose control over seeds because runs would
be repeatedly assigned to nodes. The aim here is to make sure that
each particular run has a known starting point. After a batch of
10,000 runs, we can look and say "something funny happened on run
1,323" and then we can bring that back to life later, easily.

Question: Why is this better than the simple old approach of
setting the seeds within each run with a formula like

set.seed(2345 + 10 * run)

Answer: That does allow replication, but it does not assure
that each run uses non-overlapping random number streams. It
offers absolutely no assurance whatsoever that the runs are
actually non-redundant.

Nevertheless, it is a method that is widely used and recommended
by some visible HOWTO guides.

Citations

Hana Sevcikova and A. J. Rossini (2010). snowFT: Fault Tolerant
 Simple Network of Workstations. R package version 1.2-0.
 http://CRAN.R-project.org/package=snowFT

John M Chambers (2008). SoDA: Functions and Exampels for "Software
  for Data Analysis". R package version 1.0-3.

John M Chambers (2008) Software for Data Analysis. Springer.

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas