[Rd] Recent changes in R related to CHARSXPs

Seth Falcon sfalcon at fhcrc.org
Fri May 25 17:36:00 CEST 2007

Hello all,

I want to highlight a recent change in R-devel to the larger developeR
community.  As of r41495, R maintains a global cache of CHARSXPs such
that each unique string is stored only once in memory.  For many
common use cases, such as dimnames of matrices and keys in
environments, the result is a significant savings in memory (and time
under some circumstances).

A result of these changes is that CHARSXPs must be treated as read
only objects and must never be modified in-place by assigning to the
char* returned by CHAR().  If you maintain a package that manipulates
CHARSXPs, you should check to see if you make such in-place
modifications.  If you do, the general solution is as follows:

   If you need a temp char buffer, you can allocate one with a new
   helper macro like this:

     /* CallocCharBuf takes care of the +1 for the \0,
        so the size argument is the length of your string.
     char *tmp = CallocCharBuf(n);

     /* manipulate tmp */
     SEXP schar = mkChar(tmp);

   You can also use R_alloc which has the advantage of not having to
   free it in a .Call function.

The mkChar function now consults the global CHARSXP cache and will
return an already existing CHARSXP if one with a matching string
exists.  Otherwise, it will create a new one and add it to the cache
before returning it.  

In a discussion with Herve Pages, he suggested that the return type of
CHAR(), at least for package code, be modified from (char *) to (const
char *).  I think this is an excellent suggestion because it will
allow the compiler to alert us to package C code that might be
modifying CHARSXPs in-place.  This hasn't happened yet, but I'm hoping
that a patch for this will be applied soon (unless better suggestions
for improvement arise through this discussion :-)

One other thing is worth mentioning: at present, not all CHARSXPs are
captured by the cache.  I think the goal is to refine things so that
all CHARSXPs _are_ in the cache.  At that point, strcmp calls can be
replaced with pointer comparisons which should provide some nice
speed ups.  So part of the idea is that the way to get CHARSXPs is via
mkChar or mkString and that one should not use allocString, etc.

Finally, here is a comparison of time and memory for loading all the
environments (hash tables) in Bioconductor's GO annotation data

## unpatched

    > gc()
             used (Mb) gc trigger (Mb) max used (Mb)
    Ncells 168891  9.1     350000 18.7   350000 18.7
    Vcells 115731  0.9     786432  6.0   425918  3.3
    > library("GO")
    > system.time(for (e in ls(2)) get(e))
       user  system elapsed
     51.919   1.168  53.228
    > gc()
               used  (Mb) gc trigger   (Mb) max used  (Mb)
    Ncells 17879072 954.9   19658017 1049.9 18683826 997.9
    Vcells 31702823 241.9   75190268  573.7 53912452 411.4

## patched

    > gc()
             used (Mb) gc trigger (Mb) max used (Mb)
    Ncells 154717  8.3     350000 18.7   350000 18.7
    Vcells 133613  1.1     786432  6.0   483138  3.7
    > library("GO")
    > system.time(for (e in ls(2)) get(e))
       user  system elapsed
     31.166   0.736  31.998
    > gc()
               used  (Mb) gc trigger  (Mb) max used  (Mb)
    Ncells  5837253 311.8    6910418 369.1  6193578 330.8
    Vcells 16831859 128.5   45712717 348.8 39456690 301.1

Best Wishes,

+ seth

Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center

More information about the R-devel mailing list