[Rd] Some R questions

miguel manese jjonphl at gmail.com
Wed Nov 1 03:30:55 CET 2006


Had experience with this on doing SQLiteDF...

On 11/1/06, Vladimir Dergachev <vdergachev at rcgardis.com> wrote:
> Hi all,
>    I am working with some large data sets (1-4 GB) and have some questions
> that I hope someone can help me with:
>    1.  Is there a way to turn off garbage collector from within C interface ?
>         what I am trying to do is suck data from mysql (using my own C
>         functions) and I see that allocating each column (with about 1-4 million
>         items) takes between 0.5 and 1 seconds. My first thought was that it
>         would be nice to turn off garbage collector, allocate all the data,
>         copy values and then turn the garbage collector back on.
I believe not. FWIW a numeric() vector is a chunk of memory with a
VECTOR_SEXP header and then your data contiguously allocated. If you
are desparate enough and assuming the garbage collector is indeed the
culprit, you may want to implement your own  lightweight allocVector
(the function expanded to by NEW_NUMERIC(), etc.)

>    2.  For creating STRSXP should I be using mkChar() or mkString() to create
>         element values ? Is there a way to do it without allocating a cons cell ?
>         (otherwise a single STRSXP with 1e6 length slows down garbage collector)
A string vector (STRSXP) is composed of CHARSXP's. mkChar makes ar
CHARSXP, and mkString makes a STRSXP with 1 CHARSXP, more like a
shorthand for

SET_STRING_ELT(str, 0, mkChar("foo"));

>    3.   Is "row.names" attribute required for data frames and, if so, can I
>         use some other type besides STRSXP ?
It is required. It can be integers, for 2.4.0+

>    4.   While poking around to find out why some of my code is excessively slow
>         I have come upon definition of `[.data.frame` - subscription operator
>         for data frames, which appears to be written in R. I am wondering whether
>         I am looking at the right place and whether anyone would be interested in
>         a piece of C code optimizing it - in particular extraction of single element
>         is quite slow (i.e. calls like T[i, j]).
[.data.frame is such a pain to implement because there is just too
many ways to index a data frame. You may want to do a specialized
index-er that just considers the index-ing styles you use. But I think
you are not just vectorizing enough. If you have to access your data
frames like that then it must be inside some loop, which would kill
your social life.

Or, you may just use (and pour your effort on improving) SQLiteDF

M. Manese

More information about the R-devel mailing list