[R] bug (?) with lapply / clusterMap / clusterApply etc

jacob at forestlidar.org jacob at forestlidar.org
Wed Mar 23 17:54:59 CET 2016


Very informative! Thank you.

Quoting Martin Morgan <martin.morgan at roswellpark.org>:

> On 03/22/2016 01:46 PM, jacob at forestlidar.org wrote:
>>
>> Hello I have encountered a bug(?) with the parallel package. When run
>> from within a function, the parLapply function appears to be copying the
>> entire parent environment (environment of interior of function) into all
>> child nodes in the cluster, one node at a time - which is very very slow
>> - and the copied contents are not even accessible within the child nodes
>> even though they are apparent in the memory footprint. This happens when
>> parLapply is run from within a function. I may be misusing the terms
>> "parent" and "node" here...
>>
>> The below code demonstrates the issue. The same parallel command is used
>> twice within the function, once before creating a large object, and once
>> afterwards. Both commands should take a nearly identical amount of time.
>> Initially the parallel code takes less than 1/100th of a second, but in
>> the second iteration requires hundreds of times longer...
>>
>> Example Code:
>>
>>      #create a cluster of nodes
>>      if(!"clus1" %in% ls()) clus1=makeCluster(10)
>>
>>      #function used to demonstrate bug
>>      rows_fn1=function(x,clus){
>>
>>          #first set of parallel code
>>
>> print(system.time(parLapply(clus,1:5,function(z){y=rnorm(5000);return(mean(y))})))
>>
>>
>>          #create large vector
>>          x=rnorm(10^7)
>>
>>          #second set
>>
>> print(system.time(parLapply(clus,1:5,function(z){y=rnorm(5000);return(mean(y))})))
>>
>>
>>      }
>>
>>      #demonstrate bug - watch task manager and see windows slowly copy
>> the vector to each node in the cluster
>>      rows_fn1(1:5000,clus1)
>>
>> Although the child nodes bloat proportionally to the size of x in the
>> parent environment, x is not available in the child nodes. The code
>
> With this
>
>     library(parallel)
>     cl <- makeCluster(2)
>     f <- function() {
>         x <- 10
>         parSapply(cl, 1:5, function(i) x * i)
>     }
>
> we see both that x is available, and why (so that symbols available  
> in the environment in which FUN is defined are available, just like  
> serial evaluation) the variable is copied
>
>> f()
> [1] 10 20 30 40 50
>
> Defining the function in the global environment, rather than in the  
> body of a function, avoids copying implicit state,
>
>     cl <- makeCluster(2)
>     FUN <- function(i) x * i
>     f <- function() {
>         x <- 10
>         parSapply(cl, 1:5, FUN)
>     }
>
> but requires that all arguments are defined / passed
>
>> f()
> Error in checkForRemoteErrors(val) (from #3) :
>   2 nodes produced errors; first error: object 'x' not found
>
> updating the function definition and use
>
>     FUN <- function(i, x) x * i
>     f <- function() {
>         x <- 10
>         parSapply(cl, 1:5, FUN, x)
>     }
>
>> f()
> [1] 10 20 30 40 50
>
> The foreach package tries to be smart and export only symbols used  
> (but can be tricked)
>
>     library(foreach)
>     library(doSNOW)
>     registerDoSNOW(cl)
>     g <- function() {
>         x <- 10
>         foreach(i=1:2) %dopar% { get("x") }
>     }
>
>> g()  # fails because 'x' is not referenced directly so not exported
> Error in { (from #3) : task 1 failed - "object 'x' not found"
>
> versus
>
>     g <- function() {
>         x <- 10
>         foreach(i=1:2) %dopar% { get("x"); x }
>     }
>
> and
>
>> g()  # works because 'x' referenced and exported
> [[1]]
> [1] 10
>
> [[2]]
> [1] 10
>
>
> Martin
>
>> above can be tweaked to add more variables (x1,x2,x3 ...) and the child
>> nodes will bloat to the same degree.
>>
>> I am working on Windows Server 2012, I am using 64bit R version 3.2.1. I
>> upgraded to 3.2.4revised and observed the same bug.
>>
>> I have googled for this issue and have not encountered any other
>> individuals having a similar problem.
>>
>> I have attempted to reboot my machine without effect (aside from the
>> obvious).
>>
>> Any suggestions would be greatly appreciated!
>>
>> With regards,
>>
>> Jacob L Strunk
>> Forest Biometrician (PhD), Statistician (MSc)
>> and Data Munger
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> This email message may contain legally privileged and/or  
> confidential information.  If you are not the intended recipient(s),  
> or the employee or agent responsible for the delivery of this  
> message to the intended recipient(s), you are hereby notified that  
> any disclosure, copying, distribution, or use of this email message  
> is prohibited.  If you have received this message in error, please  
> notify the sender immediately by e-mail and delete this email  
> message from your computer. Thank you.



More information about the R-help mailing list