[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Brian D Ripley ripley at stats.ox.ac.uk
Fri Jul 28 09:06:54 CEST 2006


Which version of R are you looking at?   R-devel has

    o	merge() works more efficiently when there are relatively few
	matches between the data frames (for example, for 1-1
	matching).  The order of the result is changed for 'sort = FALSE'.


On Thu, 27 Jul 2006, Kevin B. Hendricks wrote:

> Hi Developers,
>
> I am looking for another new project to help me get more up to speed
> on R and to learn something outside of R internals.   One recent R
> issue I have run into is finding a fast implementations of the
> equivalent to the following SAS code:
>
> /* MDPC is an integer sort key made from two integer columns */
> MDPC = (MD * 100000) + PCO;
>
> /* sort the dataset by the key */
> PROC SORT;
>    BY MDPC;
>
> /* print out count and sum for each unique sort key (subgroup) */
> /* use of BY MDPC requires sorting that data set by MDPC first in SAS */
> PROC UNIVARIATE NOPRINT;
>     VAR MVE;
>     BY MDPC;
>     OUTPUT OUT=TMP0 N=XN SUM=XSUM;
>
> Easy to do in R but the problem is the data set this is being run on
> has 1,742,201 lines in it and takes up 196,868,713 bytes to store as
> character data.  The sort key has easily has over 200,000 unique keys
> (if not twice that).
>
> My first R attempt was a simple
>
> # sort the data.frame gd and the sort key
> sorder <- order(MDPC)
> gd  <- gd[sorder,]
> MDPC <- MDPC[sorder]
> attach(gd)
>
> # find the length and sum for each unique sort key
> XN <- by(MVE, MDPC, length)
> XSUM <- by(MVE, MDPC, sum)
> GRPS <- levels(as.factor(MDPC))
>
> Well the ordering and sorting was reasonably fast but the first "by"
> statement was still running 4 hours later on my machine (a dual 2.6
> gig Opteron with 4 gig of main memory).  This same snippet of code in
> SAS running on a slower machine takes about 5 minutes of system time.
>
> I tried various simple R implementations of a "by_sorted" that I
> thought might help
>
> # walk sorted array once and keep track of beginning
> # and ending points for each unique sort key value in p
> # and run function fcn on that sub sequence in vector v
> # store the results in a vector
>
> by_sorted <- function(v, p, fcn) {
>     key <- p[[1]]
>     bp <- 1
>     r <- NULL
>     for (i in 2:length(p)) {
>        if (key != p[[i]]) {
>            r <- c(r,fcn(v[bp:i-1]))
>            bp <- i
>            key <- p[[i]]
>        }
>     }
>     r <- c(r,fcn(v[bp:i]))
> }
>
> but they took "forever" to run also (read that I killed those
> attempts at 15 minutes of cpu time).
>
> I literally had the same issue when trying with "tapply".
>
> So unless it already exists someplace, I need a really fast
> implementation of "by" for very large sorted data sets (probably
> written in fortran or c) that will do the equivalent of what SAS does
> with its "proc univariate by" approach with close to the same
> performance.  The code should only have to walk the array once (ie.
> be linear in time with the number of rows in the vector).   I have
> similar issues with "merge" as well since merging data frames already
> sorted by the same sort key should be fast as well and does not
> appear to be.
>
> Before I jump into this and create my own "sorted large data set"
> versions of "by" and "merge", I wanted to be sure it would be of
> interest to others.  If they work well and are well implemented (a
> big if since I am really still just learning this - the whole point
> of the project!) would something like this be of any interest for
> internal use of R?  Or is this something too specialized?
> Is there an R function implemented in c or fortran that would make a
> good "model" to follow for implementing something like this?
> Would/should they be extensions of current implementations of "merge"
> and "by" with an additions of a sorted=TRUE (defaulting to FALSE)
> extra parameter.
>
> Or am I simply barking up the wrong tree here?
>
> Thanks,
>
> Kevin
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list