[Rd] likely bug in 'serialize' or please explain the memory usage

Sklyar, Oleg (London) osklyar at maninvestments.com
Tue Nov 3 12:27:58 CET 2009


Hi all,

assume the following problem: a function call takes a function object
and a data variable and calls this function with this data on a remote
host. It uses serialization to pass both the function and the data via a
socket connection to a remote host. The problem is that depending on the
way we call the same construct, the function may be serialized to
include the data, which was not requested as the example below
demonstrates (runnable). This is a problem for parallel computing. The
problem described below is actually a problem for Rmpi and any other
parallel implementation we tested leading to endless executions in some
cases, where the total data passed is huge.

Assume the below 'mycall' is the function that takes data and a function
object, serializes them and calls the remote host. To make it runable I
just print the size of the serialized objects. In a parallel apply
implemention it would serialize individual list elements and a function
and pass those over. Assuming 1 element is 1Mb and having 100 elements
and a function as simple as function(z) z we would expect to pass around
100Mb of data, 1 Mb to each individual process. However what happens is
that in some situations all 100Mb of data are passed to all the slaves
as the function is serialized to include all of the data! This always
happens when we make such a call from an S4 method when the function we
is defined inline, see last example. 

Anybody can explain this, and possibly suggest a solution? Well, one is
-- do not define functions to call in the same environment as the caller
:(

I do not have immediate access to the newest version of R, so would be
grateful if sombody could test it in that and let me know if the problem
is still there. The example is runnable.

Thanks,
Oleg

Dr Oleg Sklyar
Research Technologist
AHL / Man Investments Ltd
+44 (0)20 7144 3803
osklyar at maninvestments.com

------------------------------------------------------------------------
-------

mycall = function(x, fun) {
    FUN = serialize(fun, NULL)
    DAT = serialize(x, NULL)
    
    cat(sprintf("length FUN=%d; length DAT=%d\n", length(FUN),
length(DAT)))
    invisible(NULL) ## return results of a call on a remote host with
FUN and DAN
}

## the function variant I  will be passing into mycall
innerfun = function(z) z
x = runif(1e6)

## test run from the command line
mycall(x, innerfun)
# output: length FUN=106; length DAT=8000022

## test run from within a function
outerfun1 = function(x) mycall(x, innerfun)
outerfun1(x)
# output: length FUN=106; length DAT=8000022

## test run from within a function, where function is defined within
outerfun2 = function(x) {
    nestedfun = function(z) z
    mycall(x, nestedfun)
}
outerfun2(x)
# output: length FUN=253; length DAT=8000022

setGeneric("outerfun3", function(x) standardGeneric("outerfun3"))
## define a method

## test run from within a method
setMethod("outerfun3", "numeric",
    function(x) mycall(x, innerfun))
outerfun3(x)
# output@ length FUN=106; length DAT=8000022

## test run from within a method, where function is defined within
setMethod("outerfun3", "numeric",
    function(x) {
        nestedfun = function(z) z
        mycall(x, nestedfun)
    })
## THIS WILL BE WRONG!
outerfun3(x)
# output: length FUN=8001680; length DAT=8000022


--------------------------------------------------
R version 2.9.0 (2009-04-17) 
x86_64-unknown-linux-gnu 

locale:
C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base


**********************************************************************
 Please consider the environment before printing this email or its attachments.
The contents of this email are for the named addressees ...{{dropped:19}}



More information about the R-devel mailing list