[Rd] Example of "task seeds" with R parallel. Critique?

Fri Jan 13 23:46:01 CET 2012

Greetings:

In R parallel's vignette, there is a comment "It would however take
only slightly more work to allocate a stream to each task." (p.6).
I've written down a working example that can allocate not just one,
but several separate seeds for each task.  (We have just a few project
here that need multiple streams).  I would like to help work that up
for inclusion in the parallel package, if possible.

This approach is not original. It combines the idea behind snowFT and
ideas for setting and monitoring seeds to replicate random streams in
John Chambers Software for Data Analysis, Section 6.10. I'm able to
save a block of seeds in a file, run a simulation for each one, and
then re-run any particular task and match the random numbers.

But I realize there are a lot of dangers I am not yet aware of, so I'm
asking you what can go wrong?  I see danger in conflicts between my
effort to manage seeds and the work of parallel functions that try to
manage seeds for me.  That's why I wish I could integrate task-based
seeds into parallel itself.

RNGkind("L'Ecuyer-CMRG")
set.seed(23456)

library(parallel)

## nrep = number of repetitions (or tasks)
## streamsPerRep = number of streams needed by each repetition
nReps <- 2000
streamsPerRep <- 2

## projSeeds=list of lists of stream seeds
projSeeds <- vector(mode="list", nReps)
for (i in 1:nReps) projSeeds[[i]] <- vector(mode="list", streamsPerRep)

runif(1) ##establishes .Random.seed
##Grab first seed
s <- .Random.seed
origSeed <- s

x <- rnorm(4) ##will compare later
x

for (i in 1:nReps) {
  for (j in 1:streamsPerRep){
    projSeeds[[i]][[j]] <- s
    s <- nextRNGStream(s)
  }
}

save(projSeeds, file="projSeeds.rda")

rm(projSeeds)

load("projSeeds.rda")

##Note that origSeed does match projSeeds
origSeed
projSeeds[[1]][[1]]

## And we get same random draws from project 1, stream 1
.Random.seed <- projSeeds[[1]][[1]]
rnorm(4)
x

##Another way (preferred?) to reset stream
assign(".Random.seed", projSeeds[[1]][[1]], envir = .GlobalEnv)
rnorm(4)

## Now, how to make this more systematic
## Each task needs streamsPerRep seeds
## startSeeds = for each stream, a starting seed
## currentSeeds = for each stream, a seed recording stream's current position
## currentStream = integer indicator of which stream is in use

## Test that interactively
currentStream <- 1
currentSeeds <- startSeeds <- projSeeds[[1]]
.Random.seed <- startSeeds[[currentStream]]

useStream <- function(n = NULL, origin = FALSE){
  if (n > length(currentSeeds)) stop("requested stream does not exist")
  currentSeeds[[currentStream]] <- .Random.seed
  if (origin) assign(".Random.seed", startSeeds[[n]], envir = .GlobalEnv)
  else assign(".Random.seed",  currentSeeds[[n]], envir = .GlobalEnv)
  currentStream <<- n
}

useStream(n=1, origin=TRUE)
rnorm(4)

currentStream

useStream(n=2, origin=TRUE)
rnorm(4)

currentStream

## Now, make that work in a clustered environment

cl <- makeCluster(9, "MPI")

## run on worker, so can retrieve seeds for particular run
initSeeds <- function(p = NULL){
  currentStream <<- 1
  projSeeds[[p]]
}

clusterEvalQ(cl, {
  RNGkind("L'Ecuyer-CMRG")
})

clusterExport(cl, c("projSeeds", "useStream", "initSeeds"))

someHorrendousFunction <- function(run, parm){
  currentStream <- 1
  currentSeeds <- startSeeds <- initSeeds(run)
  assign(".Random.seed",  startSeeds[[currentStream]],  envir = .GlobalEnv)

  ##then some gigantic, long lasting computation occurs
  dat <- data.frame(x1 = rnorm(parm$N), x2 = rnorm(parm$N), y = rnorm(parm$N))
  m1 <- lm(y ~ x1 + x2, data=dat)
  list(m1, summary(m1), model.matrix(m1))
}

whatever <- list("N" = 999)

res <- parLapply(cl, 1:nReps, someHorrendousFunction, parm = whatever)

res[[77]]

##Prove I can repeat 77'th task

res77 <- someHorrendousFunction(77, parm = whatever)

## well, that worked.
stopCluster(cl)

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas