[Rd] Moderating consequences of garbage collection when in C

Martin Morgan mtmorgan at fhcrc.org
Mon Nov 14 21:08:24 CET 2011


On 11/14/2011 11:47 AM, dhinds at sonic.net wrote:
> dhinds at sonic.net wrote:
>> Martin Morgan<mtmorgan at fhcrc.org>  wrote:
>>> Allocating many small objects triggers numerous garbage collections as R
>>> grows its memory, seriously degrading performance. The specific use case
>>> is in creating a STRSXP of several 1,000,000's of elements of 60-100
>>> characters each; a simplified illustration understating the effects
>>> (because there is initially little to garbage collect, in contrast to an
>>> R session with several packages loaded) is below.
>
>> What a coincidence -- I was just going to post a question about why it
>> is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10
>> characters long.  I had noticed that this seemed to show much worse
>> than linear scaling.  I had not thought of garbage collection as the
>> culprit -- but indeed it is.  By manipulating the GC trigger, I can
>> make this operation take as little as 3 seconds (with no GC) or as
>> long as 76 seconds (with 31 garbage collections).
>
> I had done some google searches on this issue, since it seemed like it
> should not be too uncommon, but the only other hit I could come up
> with was a thread from 2006:
>
> https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html
>
> In any case, one issue with your suggested workaround is that it
> requires knowing how much additional storage is needed, which may be
> an expensive operation to determine.  I've just tried implementing a
> different approach, which is to define two new functions to either
> disable or enable GC.  The function to disable GC first invokes
> R_gc_full() to shrink the heap as much as possible, then sets a flag.
> Then in R_gc_internal(), I first check that flag, and if it is set, I
> call AdjustHeapSize(size_needed) and exit immediately.

I think this is a better approach; mine seriously understated the 
complexity of figuring out required size.

> These calls could be used to bracket any code section that expects to
> make lots of calls to R's memory allocator.  The down side is that
> this approach requires that all paths out of such a code section
> (including error handling) need to take care to unset the GC-disabled
> flag.  I think I would want to hear from someone on the R team about
> whether they think this is a good idea.
>
> A final alternative might be to provide a vectorized version of mkChar
> that would accept a char ** and use one of these methods internally,
> rather than exporting the underlying methods as part of R's API.  I
> don't know if there are other clear use cases where GC is a serious
> bottleneck, besides constructing large vectors of mostly unique
> strings.  Such a function would be less generally useful since it
> would require that the full vector of C strings be assembled at one
> time.

Another place where this comes up is during package load, especially for 
packages with many S4 instances.

   > gcinfo(TRUE)
   > library(Matrix)
   Garbage collection 2 = 1+0+1 (level 0) ...
   7.6 Mbytes of cons cells used (40%)
   1.1 Mbytes of vectors used (18%)
   ...
   Garbage collection 58 = 39+9+10 (level 2) ...
   39.4 Mbytes of cons cells used (75%)
   2.9 Mbytes of vectors used (47%)

and continuing

   > library(IRanges)
   ...
   Garbage collection 89 = 60+14+15 (level 1) ...
   63.1 Mbytes of cons cells used (80%)
   4.3 Mbytes of vectors used (53%)

Also, something like

   > system.time(as.character(1:10000000))
   ...
   Garbage collection 124 = 60+14+50 (level 2) ...
   596.1 Mbytes of cons cells used (95%)
   226.3 Mbytes of vectors used (69%)
      user  system elapsed
   61.908   0.297  62.303

might be an R-level manifestation of the same problem.

Being able to disable / enable the GC seems like a useful patch, and I 
hope this is interesting enough for the R-core team.

A more fundamental issue seems to be garbage collection when there are a 
lot of SEXP in play

   > system.time(gc())
      user  system elapsed
     0.236   0.000   0.236

There's a hierarchy of CHARSXP / STRSXP, so maybe that could be 
exploited in the mark phase?

Martin


> -- Dave
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the R-devel mailing list