[R] Parallel package guidance needed

Therneau, Terry M., Ph.D. therneau at mayo.edu
Tue Jan 17 03:19:48 CET 2017


I have a process that I need to parallelize, and have a question about two
different ways to proceed.  It is essentially an MCMC exploration where
the likelihood is a sum over subjects (6000 of them), and the per-subject
computation is the slow part.

Here is a rough schematic of the code using one approach:

mymc <- function(formula, data, subset, na.action,  id, etc) {
     # lots of setup, long but computationally quick

     hlog <- function(thisid, param) {
        # compute the loglik for this subject
       ...
       }

    uid <- unique(id)  # multiple data rows for each subject

    for (i in 1:burnin) {
       param <- get_next_proposal()
       loglist <- mclapply(uid, hlog, param=param)
       loglik <- sum(unlist(loglist))
       # process result
       }

   # Now the non-burnin MCMC iterations
  …
}

The second approach is to put cluster formation outside the loop, e.g.,

 ...
 clust <- makeForkCluster()
 for (i in 1:burnin) {
     param <- get_next_proposal()
     loglist <- parLapply(clust, uid, hlog, param=param)
     loglik <- sum(unlist(loglist))
     # process result
     }

   # rest of the code

   stopCluster(clust)

------------------

On the face of it, the second looks like it "could" be more efficient since it
only starts and stops the subprocesses once.  A short trial on one of our
cluster servers seems to say the opposite.  The load average on a quiet machine
never gets much over 5-6  using method 2, and in the 20s for method 1
(detectCores() =80 on the box, we used mc.cores=50).  Wall time for method 2
is looking to be several hours.

Any pointers to documentation/discussion at this level would be much appreciated.  I'm going to be fitting a lot of models.

Terry T.


	[[alternative HTML version deleted]]



More information about the R-help mailing list