[Rd] Some R questions

Wed Nov 1 19:40:39 CET 2006

On Tuesday 31 October 2006 9:30 pm, miguel manese wrote:
> Hi,
>
> Had experience with this on doing SQLiteDF...
>
> On 11/1/06, Vladimir Dergachev <vdergachev at rcgardis.com> wrote:
> > Hi all,
> >
> >    I am working with some large data sets (1-4 GB) and have some
> > questions that I hope someone can help me with:
> >
> >    1.  Is there a way to turn off garbage collector from within C
> > interface ? what I am trying to do is suck data from mysql (using my own
> > C functions) and I see that allocating each column (with about 1-4
> > million items) takes between 0.5 and 1 seconds. My first thought was that
> > it would be nice to turn off garbage collector, allocate all the data,
> > copy values and then turn the garbage collector back on.
>
> I believe not. FWIW a numeric() vector is a chunk of memory with a
> VECTOR_SEXP header and then your data contiguously allocated. If you
> are desparate enough and assuming the garbage collector is indeed the
> culprit, you may want to implement your own  lightweight allocVector
> (the function expanded to by NEW_NUMERIC(), etc.)

Thank you very much for the suggestion ! After looking around in the code 
I realized that what I really wanted was R_gc_internal() - as then I can tell
the garbage collector in advance that I will require that much heap and that 
it does not need to go and allocate it each time I asked (btw I would have 
expected it to double the heap each time it runs out of it, but this is not 
what goes on, at least in R 2.3.1).

After some mucking around here is a poor mans substitute which might be 
useful:

void fault_mem_region(long size)
{
long chunk;
int max=(1<<30) / sizeof(int);
int block_count=0;
SEXP block;
while(size>0) {
	chunk=size;
	if(chunk > max)
		chunk=max;
	PROTECT(block=allocVector(INTSXP, chunk));
	block_count++;
	size-=chunk;
	}
UNPROTECT(block_count);
}

On a 48 column data frame (with 1.2e6 rows)  the call 
fault_mem_region(ncol+nrow*11+ncol*nrow)  shaved off 5 seconds from 33 second 
running time (which includes running mysql query).

It is not perfect however as I could see the last columns allocating slower 
than initial ones. 

Also, while looking around in allocVector I saw that after running garbage 
collector it simply calls malloc and if malloc fails it calls garbage 
collector again.

What would be nice is the ability to bypass that first garbage collector call 
when allocating large nodes.

>
> >    2.  For creating STRSXP should I be using mkChar() or mkString() to
> > create element values ? Is there a way to do it without allocating a cons
> > cell ? (otherwise a single STRSXP with 1e6 length slows down garbage
> > collector)
>
> A string vector (STRSXP) is composed of CHARSXP's. mkChar makes ar
> CHARSXP, and mkString makes a STRSXP with 1 CHARSXP, more like a
> shorthand for
>
> SEXP str = NEW_CHARACTER(1);
> SET_STRING_ELT(str, 0, mkChar("foo"));

Makes sense - thank you !

>
> >    3.   Is "row.names" attribute required for data frames and, if so, can
> > I use some other type besides STRSXP ?
>
> It is required. It can be integers, for 2.4.0+
>

Great !

> >    4.   While poking around to find out why some of my code is
> > excessively slow I have come upon definition of `[.data.frame` -
> > subscription operator for data frames, which appears to be written in R.
> > I am wondering whether I am looking at the right place and whether anyone
> > would be interested in a piece of C code optimizing it - in particular
> > extraction of single element is quite slow (i.e. calls like T[i, j]).
>
> [.data.frame is such a pain to implement because there is just too
> many ways to index a data frame. You may want to do a specialized
> index-er that just considers the index-ing styles you use. But I think
> you are not just vectorizing enough. If you have to access your data
> frames like that then it must be inside some loop, which would kill
> your social life.

Hmm, I thought to implement subscription with integer or logical vectors and 
then some hash-based lookup for column and (possibly) row names.

The slowness manifests itself for vectorized code as well. I believe it is due 
to the code mucking about with row.names attribute which introduces a penalty 
on any [,] operation - penalty that grows linearly with the number of rows. 

Thus for large data frames   A[,1] is slower than A[[1]]. For example, for the 
data frame I mentioned above E<-A[[1]] took 0.46 seconds (way too much in my 
opinion), but E<-A[,1] took 62.45 seconds - more than a minute and more than 
twice the time it took to load the entire thing into memory. Silly, isn't 
it ?

Also, there are good reasons to want to address individual cells. And there is 
no reason why such access cannot be constant time.

>
> <pimp-my-project>
> Or, you may just use (and pour your effort on improving) SQLiteDF
> http://cran.r-project.org/src/contrib/Descriptions/SQLiteDF.html
> </pimp-my-project>

Very nice ! The documentation mentioned something about assignment operator 
not working - is this still true ? Or, maybe, I misunderstood something ?

Also, I wonder whether it would be possible to extend [[ operator so one can 
run queries: SQLDF[["SELECT * FROM a WHERE.."]]

                           thank you very much !

                                     Vladimir Dergachev

>
> M. Manese