[R] Random Seed Location

Sun Mar 4 19:18:32 CET 2018

On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack001 at sbcglobal.net>
wrote:

(Sorry to be a bit slow responding.)

You have not supplied a complete example, which would be good in this 
case because what you are suggesting could be a serious bug in R or a 
package. Serious journals require reproducibility these days. For 
example, JSS is very clear on this point.

To your question
 > My question simply is:  should the location of the set.seed command 
matter,
 > provided that it is applied before any commands which involve randomness
 > (such as partitioning)?

the answer is no, it should not matter. But the proviso is important.

You can determine where things are messing up using something like

  set.seed(654321)
  zk <- RNGkind()        # [1] "Mersenne-Twister" "Inversion"
  zk
  z <- runif(2)
  z
  set.seed(654321)

  #  install.packages(c('caret', 'ggplot2', 'e1071'))
  library(caret)
  all(runif(2)  == z)   # should be true but it is not always

  set.seed(654321)
  library(ggplot2)
  all(runif(2)  == z)   # should be true

  set.seed(654321)
  library(e1071)
  all(runif(2)  == z)   # should be true

  all(RNGkind() == zk)  # should be true

On my computer package caret seems to sometimes, but not always, do 
something that advances or changes the RNG. So you will need to set the 
seed after that package is loaded if you want reproducibility.

As Bill Dunlap points out, parallel can introduce much more complicated 
issues. If you are in fact using parallel then we really need a new 
thread with a better subject line, and the discussion will get much messier.

The short answer is that, yes you should be able to get reproducible 
results with parallel computing. If you cannot then you are almost 
certainly doing something wrong. To publish you really must have 
reproducible results.

In the example that Bill gave, I think the problem is that set.seed() 
only resets the seed in the main thread, the nodes continue to operate 
with unreset RNG. To demonstrate this to yourself you can do

  library(parallel)
  cl <- parallel::makeCluster(3)
  parallel::clusterCall(cl, function()set.seed(100))
  parallel::clusterCall(cl, function()RNGkind())
  parallel::clusterCall(cl, function()runif(2)) # similar result from 
all nodes
                                                # [1] 0.3077661 0.2576725

However, do *NOT* do that in real work. You will be getting the same RNG 
stream from each node. If you are using random numbers and parallel you 
need to read a lot more, and probably consider a variant of the 
"L'Ecuyer" generator or something designed for parallel computing.

One special point I will mention because it does not seem to be widely
appreciated: the number of nodes affects the random stream, so recording 
the number of compute nodes along with the RNG and seed information is 
important for reproducible results. This has the unfortunate consequence 
that an experiment cannot be reproduced on a larger cluster. (If anyone 
knows differently I would very much like to hear.)

Paul Gilbert

 > Hi all,
 >
 > For some odd reason when running naïve bayes, k-NN, etc., I get slightly
 > different results (e.g., error rates, classification probabilities) from
 > run
 > to run even though I am using the same random seed.
 >
 > Nothing else (input-wise) is changing, but my results are somewhat
 > different
 > from run to run.  The only randomness should be in the partitioning, 
and I
 > have set the seed before this point.
 >
 > My question simply is:  should the location of the set.seed command 
matter,
 > provided that it is applied before any commands which involve randomness
 > (such as partitioning)?
 >
 > If you need to see the code, it is below:
 >
 > Thank you,
 > Gary
 >
 >
 > A.      Separate the original (in-sample) data from the new 
(out-of-sample)
 > data.  Set a random seed.
 >
 >> InvestTech <- as.data.frame(InvestTechRevised)
 >> outOfSample <- InvestTech[5001:nrow(InvestTech), ]
 >> InvestTech <- InvestTech[1:5000, ]
 >> set.seed(654321)
 >
 > B.      Install and load the caret, ggplot2 and e1071 packages.
 >
 >> install.packages(“caret”)
 >> install.packages(“ggplot2”)
 >> install.packages(“e1071”)
 >> library(caret)
 >> library(ggplot2)
 >> library(e1071)
 >
 > C.      Bin the predictor variables with approximately equal counts using
 > the cut_number function from the ggplot2 package.  We will use 20 bins.
 >
 >> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20)
 >> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20)
 >> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20)
 >> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20)
 >
 > D.      Partition the original (in-sample) data into 60% training and 40%
 > validation sets.
 >
 >> n <- nrow(InvestTech)
 >> train <- sample(1:n, size = 0.6 * n, replace = FALSE)
 >> InvestTechTrain <- InvestTech[train, ]
 >> InvestTechVal <- InvestTech[-train, ]
 >
 > E.      Use the naiveBayes function in the e1071 package to fit the 
model.
 >
 >> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = 
InvestTechTrain)
 >> prob <- predict(model, newdata = InvestTechVal, type = “raw”)
 >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
 >
 > F.      Use the confusionMatrix function in the caret package to 
output the
 > confusion matrix.
 >
 >> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode =
 > “everything”, positive = “1”)
 >> accuracy <- confMtr$overall[1]
 >> valError <- 1 – accuracy
 >> confMtr
 >
 > G.      Classify the 18 new (out-of-sample) readers using the following
 > code.
 >> prob <- predict(model, newdata = outOfSample, type = “raw”)
 >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
 >> cbind(pred, prob, outOfSample[, -3])
 >
 >
 >

If your computations involve the parallel package then set.seed(seed)
may not produce repeatable results.  E.g.,

 > cl <- parallel::makeCluster(3)  # Create cluster with 3 nodes on local
host
 > set.seed(100); runif(2)
[1] 0.3077661 0.2576725
 > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
          [,1]     [,2]     [,3]
[1,] 101.7779 102.5308 103.3459
[2,] 101.8128 102.6114 103.9102
 >
 > set.seed(100); runif(2)
[1] 0.3077661 0.2576725
 > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
          [,1]     [,2]     [,3]
[1,] 101.1628 102.9643 103.2684
[2,] 101.9205 102.6937 103.7907

Bill Dunlap
TIBCO Software
wdunlap tibco.com