[Rd] likely bug in 'serialize' or please explain the memory usage

Duncan Murdoch murdoch at stats.uwo.ca
Tue Nov 3 14:01:19 CET 2009


On 03/11/2009 7:29 AM, Sklyar, Oleg (London) wrote:
> Duncan,
> 
> thanks for suggestions, I will try attaching a new environment.
> 
> However this still does not explain the behaviour and does not confirm
> that it is correct. What puzzles me most is that if I define a function
> within another function then only the function gets serialized, yet when
> this is withing an S4 method definition, then also the args. 


Okay, I've taken a look at your code.  I think what you're seeing is 
lazy evaluation.  S4 generics evaluate their args when they dispatch to 
a method, but normal functions don't.  So the increase from 106 bytes to 
253 bytes when the function was nested in a regular function was to hold 
the promise to evaluate x, whereas in the method, x had been evaluated 
to determine that it was numeric, and your particular method should be 
dispatched to.

So if in your nested case you add a line

force(x)

I think you'll see the size balloon up.

Now, it might be a problem that you're serializing a promise, because I 
think you'd likely get trouble with something like this:

  outerfun2 = function(x) {
      nestedfun = function() x
      mycall(x, nestedfun)
  }

If you serialize nestedfun and it only saves the promise to evaluate x, 
then unserialize it somewhere else, the promise probably won't evaluate 
to what you expected.  But you often get problems when you create 
functions that depend on unevaluated promises, and there might be a 
valid reason to want to serialize one, so I wouldn't call it a bug.

Duncan Murdoch

Both have
> their own environments, so I do not see why it should be different. As
> an interim measure I just removed all the inline function definitions
> from these 'parallel' calls defining the functions as hidden outside of
> the caller, a bit ugly but works. I'd be thankful if you could look at
> the examples when you get some more time.
> 
> My main problem is less in ensuring that my code works, but in ensuring
> that when users use these parallel functionalities with their code, they
> do not get stuck in transferring data for ages simply because with every
> function one gets all the data passed.
> 
> Best,
> Oleg
> 
> Dr Oleg Sklyar
> Research Technologist
> AHL / Man Investments Ltd
> +44 (0)20 7144 3803
> osklyar at maninvestments.com 
> 
>> -----Original Message-----
>> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca] 
>> Sent: 03 November 2009 11:59
>> To: Sklyar, Oleg (London)
>> Cc: r-devel at r-project.org
>> Subject: Re: [Rd] likely bug in 'serialize' or please explain 
>> the memory usage
>>
>> I haven't had a chance to look really closely at this, but I 
>> would guess 
>> the problem is that in R functions are "closures".  The environment 
>> attached to the function will be serialized along with it, so if you 
>> have a big dataset in the same environment, you'll get that too.
>>
>> I vaguely recall that the global environment and other system 
>> environments are handled specially, so that's not true for functions 
>> created at the top level, but I'd have to do some experiments 
>> to confirm.
>>
>> So the solution to your problem is to pay attention to the 
>> environment 
>> of the functions you create.  If they need to refer to local 
>> variables 
>> in the creating frame, then
>> you'll get all of them, so be careful about what you create 
>> there.  If 
>> they don't need to refer to the local frame you can just attach a new 
>> smaller environment after building the function.
>>
>> Duncan Murdoch
>>
>> Sklyar, Oleg (London) wrote:
>>> Hi all,
>>>
>>> assume the following problem: a function call takes a 
>> function object
>>> and a data variable and calls this function with this data 
>> on a remote
>>> host. It uses serialization to pass both the function and 
>> the data via a
>>> socket connection to a remote host. The problem is that 
>> depending on the
>>> way we call the same construct, the function may be serialized to
>>> include the data, which was not requested as the example below
>>> demonstrates (runnable). This is a problem for parallel 
>> computing. The
>>> problem described below is actually a problem for Rmpi and any other
>>> parallel implementation we tested leading to endless 
>> executions in some
>>> cases, where the total data passed is huge.
>>>
>>> Assume the below 'mycall' is the function that takes data 
>> and a function
>>> object, serializes them and calls the remote host. To make 
>> it runable I
>>> just print the size of the serialized objects. In a parallel apply
>>> implemention it would serialize individual list elements 
>> and a function
>>> and pass those over. Assuming 1 element is 1Mb and having 
>> 100 elements
>>> and a function as simple as function(z) z we would expect 
>> to pass around
>>> 100Mb of data, 1 Mb to each individual process. However 
>> what happens is
>>> that in some situations all 100Mb of data are passed to all 
>> the slaves
>>> as the function is serialized to include all of the data! 
>> This always
>>> happens when we make such a call from an S4 method when the 
>> function we
>>> is defined inline, see last example. 
>>>
>>> Anybody can explain this, and possibly suggest a solution? 
>> Well, one is
>>> -- do not define functions to call in the same environment 
>> as the caller
>>> :(
>>>
>>> I do not have immediate access to the newest version of R, 
>> so would be
>>> grateful if sombody could test it in that and let me know 
>> if the problem
>>> is still there. The example is runnable.
>>>
>>> Thanks,
>>> Oleg
>>>
>>> Dr Oleg Sklyar
>>> Research Technologist
>>> AHL / Man Investments Ltd
>>> +44 (0)20 7144 3803
>>> osklyar at maninvestments.com
>>>
>>>
>> --------------------------------------------------------------
>> ----------
>>> -------
>>>
>>> mycall = function(x, fun) {
>>>     FUN = serialize(fun, NULL)
>>>     DAT = serialize(x, NULL)
>>>     
>>>     cat(sprintf("length FUN=%d; length DAT=%d\n", length(FUN),
>>> length(DAT)))
>>>     invisible(NULL) ## return results of a call on a remote 
>> host with
>>> FUN and DAN
>>> }
>>>
>>> ## the function variant I  will be passing into mycall
>>> innerfun = function(z) z
>>> x = runif(1e6)
>>>
>>> ## test run from the command line
>>> mycall(x, innerfun)
>>> # output: length FUN=106; length DAT=8000022
>>>
>>> ## test run from within a function
>>> outerfun1 = function(x) mycall(x, innerfun)
>>> outerfun1(x)
>>> # output: length FUN=106; length DAT=8000022
>>>
>>> ## test run from within a function, where function is defined within
>>> outerfun2 = function(x) {
>>>     nestedfun = function(z) z
>>>     mycall(x, nestedfun)
>>> }
>>> outerfun2(x)
>>> # output: length FUN=253; length DAT=8000022
>>>
>>> setGeneric("outerfun3", function(x) standardGeneric("outerfun3"))
>>> ## define a method
>>>
>>> ## test run from within a method
>>> setMethod("outerfun3", "numeric",
>>>     function(x) mycall(x, innerfun))
>>> outerfun3(x)
>>> # output@ length FUN=106; length DAT=8000022
>>>
>>> ## test run from within a method, where function is defined within
>>> setMethod("outerfun3", "numeric",
>>>     function(x) {
>>>         nestedfun = function(z) z
>>>         mycall(x, nestedfun)
>>>     })
>>> ## THIS WILL BE WRONG!
>>> outerfun3(x)
>>> # output: length FUN=8001680; length DAT=8000022
>>>
>>>
>>> --------------------------------------------------
>>> R version 2.9.0 (2009-04-17) 
>>> x86_64-unknown-linux-gnu 
>>>
>>> locale:
>>> C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>>
>>>
>> **********************************************************************
>>>  Please consider the environment before printing this email 
>> or its attachments.
>>> The contents of this email are for the named addressees 
>> ...{{dropped:19}}
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>   
>>
> 
> **********************************************************************
>  Please consider the environment before printing this email or its attachments.
> The contents of this email are for the named addressees ...{{dropped:19}}
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list