[Rd] mcparallel / mccollect

Tue Aug 30 23:31:05 CEST 2016

Michel,

thanks, you're right, that the list should have names. Your patch has the match() part backwards, but is otherwise the right idea. I have committed a variant in R-devel and will back-port later.

Thanks,
Simon

> On Aug 30, 2016, at 8:43 AM, Michel Lang <michellang at gmail.com> wrote:
> 
> Hi there,
> 
> I've tried to implement an asynchronous job scheduler using
> parallel::mcparallel() and parallel::mccollect(..., wait=FALSE). My
> goal was to send processes to the background, leaving the R session
> open for interactive use while all jobs store their results on the
> file system. To keep track of the running jobs I've stored the process
> ids and written a little helper to not spawn new threads before
> already started threads have terminated if the maximum number of CPUs
> is reached.
> 
> Unfortunately, this turned out to be impossible with the current
> implementation in parallel for a number of reasons:
> 
> 1) The returned results are not named by process id or job name if
> wait is set to FALSE.
> 
> 2) The number of returned results depends on the state of computation:
> If all or none jobs are finished, just NULL is returned. Otherwise a
> list of so far collected results is returned.
> 
> 3) Combining (1) and (2) renders mapping the results to the stored
> process ids impossible. E.g., if you query mccollect for the results
> of 4 jobs and set wait=FALSE, you can get an unnamed list with one
> result or a list with four results but in a different order.
> 
> 4) An obvious workaround would wrap the expression to evaluate in a
> function which sticks a unique identifier to the return value. This
> way, one would not have to rely on process ids or job names. However,
> each job has to be collected twice:  the first time you get the result
> (which is fine for the workaround), the second time you just get NULL.
> And you have to collect them twice to free used resources -- at least
> on unix systems.
> 
> Here is a small example to illustrate the current behavior:
> 
> library(parallel)
> f = function(x) { Sys.sleep(x); sprintf("job with x = %i", x) }
> jobs = integer()
> jobs = c(jobs, mcparallel(f(10), name = "jobname1")$pid)
> jobs = c(jobs, mcparallel(f(3), name = "jobname2")$pid)
> 
> for (i in 1:13) {
>  message("\ni = ", i)
>  print(mccollect(jobs, wait = FALSE, timeout = 0))
>  Sys.sleep(1)
> }
> 
> I've created a small patch
> (<https://gist.github.com/mllg/82410d0f564a7a24251e9e747e210b39>)
> which applies the same mechanism to name the results for wait=FALSE as
> it was already implemented for wait=TRUE. I think the documentation is
> already rather describing the behavior after my patch than before my
> patch.
> A note on the need to collect results twice might prove useful for the
> future though.
> 
> Thanks,
> Michel
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>