[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Kevin B. Hendricks kevin.hendricks at sympatico.ca
Mon Jul 31 23:19:52 CEST 2006


Hi Thomas,

Here is a comparison of performance times from my own igroupSums  
versus using split and rowsum:

 > x <- rnorm(2e6)
 > i <- rep(1:1e6,2)
 >
 > unix.time(suma <- unlist(lapply(split(x,i),sum)))
[1] 8.188 0.076 8.263 0.000 0.000
 >
 > names(suma)<- NULL
 >
 > unix.time(sumb <- igroupSums(x,i))
[1] 0.036 0.000 0.035 0.000 0.000
 >
 > all.equal(suma, sumb)
[1] TRUE
 >
 > unix.time(sumc <- rowsum(x,i))
[1] 0.744 0.000 0.742 0.000 0.000
 >
 > sumc <- sumc[,1]
 > names(sumc)<-NULL
 > all.equal(suma,sumc)
[1] TRUE


So my implementation of igroupSums is faster and already handles NA.   
I also have implemented igroupMins, igroupMaxs, igroupAnys,  
igroupAlls, igroupCounts, igroupMeans, and igroupRanges.

The igroup functions I implemented do not handle weights yet but do  
handle NAs properly.

Assuming I clean them up, is anyone in the R developer group interested?

Or would you rather I instead extend the rowsum appropach to create  
rowcount, rowmax, rowmin, rowcount, etc using a hash function approach.

All of these approaches simply use differently ways to map group  
codes to integers and then do the functions the same.

Thanks,

Kevin



More information about the R-devel mailing list