[Rd] Reduce memory peak when serializing to raw vectors

Simon Urbanek simon.urbanek at r-project.org
Tue Mar 17 23:13:29 CET 2015


In principle, yes (that's what Rserve serialization does), but AFAIR we don't have the infrastructure in place for that. But then you may as well serialize to a connection instead. To be honest I don't see why you would serialize anything big to a vector - you can't really do anything useful with that ... (what you couldn't do with the streaming version).

Sent from my iPhone

> On Mar 17, 2015, at 17:48, Michael Lawrence <lawrence.michael at gene.com> wrote:
> 
> Presumably one could stream over the data twice, the first to get the size, without storing the data. Slower but more memory efficient, unless I'm missing something.
> 
> Michael
> 
>> On Tue, Mar 17, 2015 at 2:03 PM, Simon Urbanek <simon.urbanek at r-project.org> wrote:
>> Jorge,
>> 
>> what you propose is not possible because the size of the output is unknown, that's why a dynamically growing PStream buffer is used - it cannot be pre-allocated.
>> 
>> Cheers,
>> Simon
>> 
>> 
>> > On Mar 17, 2015, at 1:37 PM, Martinez de Salinas, Jorge <jorge.martinez-de-salinas at hp.com> wrote:
>> >
>> > Hi,
>> >
>> > I've been doing some tests using serialize() to a raw vector:
>> >
>> >       df <- data.frame(runif(50e6,1,10))
>> >       ser <- serialize(df,NULL)
>> >
>> > In this example the data frame and the serialized raw vector occupy ~400MB each, for a total of ~800M. However the memory peak during serialize() is ~1.2GB:
>> >
>> >       $ cat /proc/15155/status |grep Vm
>> >       ...
>> >       VmHWM:   1207792 kB
>> >       VmRSS:    817272 kB
>> >
>> > We work with very large data frames and in many cases this is killing R with an "out of memory" error.
>> >
>> > This is the relevant code in R 3.1.3 in src/main/serialize.c:2494
>> >
>> >       InitMemOutPStream(&out, &mbs, type, version, hook, fun);
>> >       R_Serialize(object, &out);
>> >       val =  CloseMemOutPStream(&out);
>> >
>> > The serialized object is being stored in a buffer pointed by out.data. Then in CloseMemOutPStream() R copies the whole buffer to a newly allocated SEXP object (the raw vector that stores the final result):
>> >
>> >       PROTECT(val = allocVector(RAWSXP, mb->count));
>> >       memcpy(RAW(val), mb->buf, mb->count);
>> >       free_mem_buffer(mb);
>> >       UNPROTECT(1);
>> >
>> > Before calling free_mem_buffer() the process is using ~1.2GB (the original data frame + the serialization buffer + final serialized raw vector).
>> >
>> > One possible solution would be to allocate a buffer for the final raw vector and store the serialization result directly into that buffer. This would bring the memory peak down from ~1.2GB to ~800MB.
>> >
>> > Thanks,
>> > -Jorge
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

	[[alternative HTML version deleted]]



More information about the R-devel mailing list