[Rd] Moderating consequences of garbage collection when in C

Thomas Lumley tlumley at uw.edu
Mon Nov 14 21:00:10 CET 2011


On Tue, Nov 15, 2011 at 8:47 AM,  <dhinds at sonic.net> wrote:
> dhinds at sonic.net wrote:
>> Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> > Allocating many small objects triggers numerous garbage collections as R
>> > grows its memory, seriously degrading performance. The specific use case
>> > is in creating a STRSXP of several 1,000,000's of elements of 60-100
>> > characters each; a simplified illustration understating the effects
>> > (because there is initially little to garbage collect, in contrast to an
>> > R session with several packages loaded) is below.
>
>> What a coincidence -- I was just going to post a question about why it
>> is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10
>> characters long.  I had noticed that this seemed to show much worse
>> than linear scaling.  I had not thought of garbage collection as the
>> culprit -- but indeed it is.  By manipulating the GC trigger, I can
>> make this operation take as little as 3 seconds (with no GC) or as
>> long as 76 seconds (with 31 garbage collections).
>
> I had done some google searches on this issue, since it seemed like it
> should not be too uncommon, but the only other hit I could come up
> with was a thread from 2006:
>
> https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html
>
> In any case, one issue with your suggested workaround is that it
> requires knowing how much additional storage is needed, which may be
> an expensive operation to determine.  I've just tried implementing a
> different approach, which is to define two new functions to either
> disable or enable GC.  The function to disable GC first invokes
> R_gc_full() to shrink the heap as much as possible, then sets a flag.
> Then in R_gc_internal(), I first check that flag, and if it is set, I
> call AdjustHeapSize(size_needed) and exit immediately.
>
> These calls could be used to bracket any code section that expects to
> make lots of calls to R's memory allocator.  The down side is that
> this approach requires that all paths out of such a code section
> (including error handling) need to take care to unset the GC-disabled
> flag.  I think I would want to hear from someone on the R team about
> whether they think this is a good idea.

If .Call and .C re-enabled the GC on return from compiled code (and
threw some sort of error) that would help contain the potential
damage.

You'd might also  want to re-enable GC if malloc() returned NULL,
rather than giving an out-of-memory error.

  -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland



More information about the R-devel mailing list