[R] Fast set intersection

Stavros Macrakis macrakis at alum.mit.edu
Sat Jan 17 22:06:26 CET 2009


I was wondering if there were any R packages supporting fast set intersection.

I am aware of base::intersection, bit::`&`, and set::set_intersection,
each of which has its advantages.  base::intersection is
space-efficient and accepts arbitrary unsorted lists of arbitrary base
types.  set::set_intersection is very fast but requires space and time
proportional to the size of the universe, a contiguous range of
integers.  set::set_intersection is nicely structured but very slow.
For example:

     q<-sample(1e7,2e6); r<-sample(1e7,1e4)
     system.time(intersect(q,r))  => 70 mS
     qb<-as.bit.which(q,1e7); qr<- as.bit.which(r,1e7);
system.time(qb&qr) => 3 mS
     system.time(set_intersection(as.set(q),as.set(r))) => many minutes
           (with |q|=2e3, time is 3 S)

I was wondering if there were other set packages which use different
representations of sets (e.g. sorted lists, heaps, etc.) to make some
set operations more efficient. In particular, I'd be interested in an
intersection operator which is fast and space-efficient when the sets
are sparse and quite different in size, such as binary merge or
Baeza-Yates (2004).

Along the same lines, I wonder if there is a version of 'match' which
takes a sorted 'table' argument and uses binary or interpolation
search?

Thanks,

           -s




More information about the R-help mailing list