[Rd] Some R questions
vdergachev at rcgardis.com
Wed Nov 1 19:40:39 CET 2006
On Tuesday 31 October 2006 9:30 pm, miguel manese wrote:
> Had experience with this on doing SQLiteDF...
> On 11/1/06, Vladimir Dergachev <vdergachev at rcgardis.com> wrote:
> > Hi all,
> > I am working with some large data sets (1-4 GB) and have some
> > questions that I hope someone can help me with:
> > 1. Is there a way to turn off garbage collector from within C
> > interface ? what I am trying to do is suck data from mysql (using my own
> > C functions) and I see that allocating each column (with about 1-4
> > million items) takes between 0.5 and 1 seconds. My first thought was that
> > it would be nice to turn off garbage collector, allocate all the data,
> > copy values and then turn the garbage collector back on.
> I believe not. FWIW a numeric() vector is a chunk of memory with a
> VECTOR_SEXP header and then your data contiguously allocated. If you
> are desparate enough and assuming the garbage collector is indeed the
> culprit, you may want to implement your own lightweight allocVector
> (the function expanded to by NEW_NUMERIC(), etc.)
Thank you very much for the suggestion ! After looking around in the code
I realized that what I really wanted was R_gc_internal() - as then I can tell
the garbage collector in advance that I will require that much heap and that
it does not need to go and allocate it each time I asked (btw I would have
expected it to double the heap each time it runs out of it, but this is not
what goes on, at least in R 2.3.1).
After some mucking around here is a poor mans substitute which might be
void fault_mem_region(long size)
int max=(1<<30) / sizeof(int);
if(chunk > max)
On a 48 column data frame (with 1.2e6 rows) the call
fault_mem_region(ncol+nrow*11+ncol*nrow) shaved off 5 seconds from 33 second
running time (which includes running mysql query).
It is not perfect however as I could see the last columns allocating slower
than initial ones.
Also, while looking around in allocVector I saw that after running garbage
collector it simply calls malloc and if malloc fails it calls garbage
What would be nice is the ability to bypass that first garbage collector call
when allocating large nodes.
> > 2. For creating STRSXP should I be using mkChar() or mkString() to
> > create element values ? Is there a way to do it without allocating a cons
> > cell ? (otherwise a single STRSXP with 1e6 length slows down garbage
> > collector)
> A string vector (STRSXP) is composed of CHARSXP's. mkChar makes ar
> CHARSXP, and mkString makes a STRSXP with 1 CHARSXP, more like a
> shorthand for
> SEXP str = NEW_CHARACTER(1);
> SET_STRING_ELT(str, 0, mkChar("foo"));
Makes sense - thank you !
> > 3. Is "row.names" attribute required for data frames and, if so, can
> > I use some other type besides STRSXP ?
> It is required. It can be integers, for 2.4.0+
> > 4. While poking around to find out why some of my code is
> > excessively slow I have come upon definition of `[.data.frame` -
> > subscription operator for data frames, which appears to be written in R.
> > I am wondering whether I am looking at the right place and whether anyone
> > would be interested in a piece of C code optimizing it - in particular
> > extraction of single element is quite slow (i.e. calls like T[i, j]).
> [.data.frame is such a pain to implement because there is just too
> many ways to index a data frame. You may want to do a specialized
> index-er that just considers the index-ing styles you use. But I think
> you are not just vectorizing enough. If you have to access your data
> frames like that then it must be inside some loop, which would kill
> your social life.
Hmm, I thought to implement subscription with integer or logical vectors and
then some hash-based lookup for column and (possibly) row names.
The slowness manifests itself for vectorized code as well. I believe it is due
to the code mucking about with row.names attribute which introduces a penalty
on any [,] operation - penalty that grows linearly with the number of rows.
Thus for large data frames A[,1] is slower than A[]. For example, for the
data frame I mentioned above E<-A[] took 0.46 seconds (way too much in my
opinion), but E<-A[,1] took 62.45 seconds - more than a minute and more than
twice the time it took to load the entire thing into memory. Silly, isn't
Also, there are good reasons to want to address individual cells. And there is
no reason why such access cannot be constant time.
> Or, you may just use (and pour your effort on improving) SQLiteDF
Very nice ! The documentation mentioned something about assignment operator
not working - is this still true ? Or, maybe, I misunderstood something ?
Also, I wonder whether it would be possible to extend [[ operator so one can
run queries: SQLDF[["SELECT * FROM a WHERE.."]]
thank you very much !
> M. Manese
More information about the R-devel