[BioC] snow library, question on clusterExport

Martin Morgan mtmorgan at fhcrc.org
Sat Apr 10 01:22:53 CEST 2010


Hi Mattia --

Probably the newsgroup

  https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

is appropriate, but...

On 04/09/2010 03:47 PM, mattia pelizzola wrote:
> Hi,
> 
> I have a simple function:
> 
>> library(snow)
>> fun2=function() {
> + cl=makeCluster(3)
> + Mat=matrix(2:10,3,3)
> + fun3=function(startInd, endInd=3, data=Mat) {Mat[startInd:endInd,]}
> + print(clusterApplyLB(cl, 1:3, fun3))
> + stopCluster(cl)
> + }

A function includes, as part of its definition, the environment it is
defined in. So

f1 <- function() {
    f2 <- function() {}
    x <- 1
    browser()
}

> f1()
Called from: f1()
Browse[1]> environment(f2)
<environment: 0xb540b0>
Browse[1]> ls(environment(f2))
[1] "f2" "x"
Browse[1]> environment(f2)[["x"]]
[1] 1

In something like clusterApplyLB, snow sends 'fun3' to the worker. This
includes 'fun3's environment, and that in turn includes the variable 'Mat'.

Note that this could be a big surprise, e.g.,

f1 = function() {
   f2 = function(i) i^2
   m = matrix(numeric(1e7), 1e3)
   clusterApplyLB(cl, 1:10, f2)
}

sends the matrix 'm' to each node in the cluster (because it is defined
in the evironment of f2), even though it is irrelevant to the
calculation performed by f2. To illustrate


f1 <- function(cl, x, do) {
    f2 <- function(i) ls(environment())
    y <- x
    if (do) clusterApply(cl, 1:2, f2)
}

this sends a short vector

> x <- integer(1); system.time(f1(cl, x, TRUE))
   user  system elapsed
  0.000   0.000   0.001

and a long vector, so takes more time

> x <- integer(1e6); system.time(f1(cl, x, TRUE))
   user  system elapsed
  0.096   0.040   0.329


and here demonstrating that it's not the vector per se, but the transport

> x <- integer(1e6); system.time(f1(cl, x, FALSE))
   user  system elapsed
      0       0       0


> that is working fine:
> 
>> fun2()
> [[1]]
>      [,1] [,2] [,3]
> [1,]    2    5    8
> [2,]    3    6    9
> [3,]    4    7   10
> 
> [[2]]
>      [,1] [,2] [,3]
> [1,]    3    6    9
> [2,]    4    7   10
> 
> [[3]]
> [1]  4  7 10
> 
> now, if I run the same commands outside the function:
> 
>> cl=makeCluster(3)
>> Mat=matrix(2:10,3,3)
>> fun3=function(startInd, endInd=3, data=Mat) {Mat[startInd:endInd,]}
>> print(clusterApplyLB(cl, 1:3, fun3))
> Error in checkForRemoteErrors(val) :
>   3 nodes produced errors; first error: object 'Mat' not found

Here snow has a special rule, which is 'do not export the global
environment'. So environment(fun3) == .GlobalEnv, and 'Mat' is not
exported, and not available to the worker.

> 
> so I figured out I have to export 'Mat' on the cluster nodes:
> 
>> clusterExport(cl, 'Mat')
>> print(clusterApplyLB(cl, 1:3, fun3))
> [[1]]
>      [,1] [,2] [,3]
> [1,]    2    5    8
> [2,]    3    6    9
> [3,]    4    7   10
> 
> [[2]]
>      [,1] [,2] [,3]
> [1,]    3    6    9
> [2,]    4    7   10
> 
> [[3]]
> [1]  4  7 10
> 
> I still do not understand why clusterExport is NOT necessary within
> the function 'fun2' and actually it would give an error:
> 
>> rm(Mat)
>> fun2=function() {
> + cl=makeCluster(3)
> + Mat=matrix(2:10,3,3)
> + clusterExport(cl, 'Mat')
> + fun3=function(startInd, endInd=3, data=Mat) {Mat[startInd:endInd,]}
> + print(clusterApplyLB(cl, 1:3, fun3))
> + stopCluster(cl)
> + }
>> fun2()
> Error in get(name, env = .GlobalEnv) : object 'Mat' not found

from ?clusterExport,

‘clusterExport’ assigns the global values on the master of the
     variables named in ‘list’ to variables of the same names in the
     global environments of each node.

so snow is just doing what it is documented to do.

> 
> I found clusterExport to be the solution for a more complex example,
> can I can't make it working within a function.
> What is it happening here with clusterExport? and how can I export an
> object that is not on my globalEnv but rather is created within a
> function?

Hope that provides enough information to work through your problem.

Martin

> 
> many thanks!
> 
> mattia
> 
>> sessionInfo()
> R version 2.10.1 (2009-12-14)
> x86_64-unknown-linux-gnu
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] snow_0.3-3
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list