[Rd] data frame subset patch, take 2

Vladimir Dergachev vdergachev at rcgardis.com
Wed Dec 13 19:03:21 CET 2006


On Wednesday 13 December 2006 6:01 am, Martin Maechler wrote:
>
> - Vladimir, have you verified your 'take2' against recent versions
>   of R-devel?

Yes. 

>
> - If they still work, could you re-post them to R-devel, this
>   time using a proper MIME type,
>   i.e. most probably one of
>      application/x-tar
>      application/x-compressed-tar
>      application/x-gzip
>
>   In case you don't know how to achieve this,
>   I'd be interested to get it by "private" e-mail.

No problem. The old e-mail did have a mime type: "text/x-diff".
I am resending the patch - now compressed, hopefully it will get pass whatever 
filters are in place.

With regard to speedups in R, here is my wish list - I would greatly 
appreciate comments on what makes sense here or not, etc:

    1. I greatly miss equivalents of Tcl append and lappend commands - not
the function performed by these commands but efficiency (they are O(1) on 
average). Tcl easily handles lists with 1e6 components and strings of 10s of 
megabytes in length.

    2. It would be nice to have true hashed arrays in R (i.e. O(1) access 
times). So far I have used named lists for this, but they are O(n):

> L<-list(); system.time(for(i in 1:10000)L[[paste(i)]]<-i);
[1] 2.864 0.004 2.868 0.000 0.000
> L<-list(); system.time(for(i in 1:20000)L[[paste(i)]]<-i);
[1] 11.789  0.216 12.004  0.000  0.000

    3. Efficient manipulation of large numbers of strings. The big reason 
character row.names are slow is because they require a large number of string 
objects that slow down garbage collector.

    This is possibly not a problem that has an easy solution, here are a 
couple of approaches I have considered:

        a) Inline strings - use a structure like
		union {
			struct { 
				unsigned char size;
				char body[15];
				} inlined_string;  /* use this when size<16 */

			struct {
				unsigned char flag;  
				char reserved[7]; /* for 64 bit */
				CHRSXP ptr;
				} indirect_string; /* use this when flag=255 */
			}

             This basically turns small strings into an enum type stored 
within a 128-bit integer. This would greatly decrease required number of 
CHRSXP in many common cases (in particular for many rownames). 

             The biggest disadvantage is more complicated access to string 
data. Also this does not solve the issue of how to deal with 1e6 long 
strings - though I feel like 15 characters should be good enough for most 
uses.

	b) CHRSXPs are always leaf nodes. One could implement true reference counting
and create a separate garbage collector pool for them. This way one can rely 
on reference counting to free string objects during normal operation, but 
also keep track of the number of referenced strings during garbage collector 
passes - and trigger string garbage collection passes (with a warning) when 
the number of referenced strings is much smaller the number of objects in 
string pool.

            This gets rid of overhead that strings impose on garbage 
collector. The disadvantage are very large changes to R code.

                            best

                                Vladimir Dergachev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: subset.patch.2.diff.gz
Type: application/x-gzip
Size: 1474 bytes
Desc: not available
Url : https://stat.ethz.ch/pipermail/r-devel/attachments/20061213/33598b1d/attachment.gz 


More information about the R-devel mailing list