[Rd] Any interest in "merge" and "by" implementations specifically for so

Mon Jul 31 15:41:53 CEST 2006

Hi Tom,

> Now, try sorting and using a loop:
>
>> idx <- order(i)
>> xs <- x[idx]
>> is <- i[idx]
>> res <- array(NA, 1e6)
>> idx <- which(diff(is) > 0)
>> startidx <- c(1, idx+1)
>> endidx <- c(idx, length(xs))
>> f1 <- function(x, startidx, endidx, FUN = sum)  {
> +   for (j in 1:length(res)) {
> +     res[j] <- FUN(x[startidx[j]:endidx[j]])
> +   }
> +   res
> + }
>> unix.time(res1 <- f1(xs, startidx, endidx))
> [1] 6.86 0.00 7.04   NA   NA

I wonder how much time the sorting, reordering and creation os  
startidx and endidx would add to this time?

Either way, your code can nicely be used to quickly create the small  
integer factors I would need if the igroup functions get integrated.   
Thanks!

> For the case of sum (or averages), you can vectorize this using  
> cumsum as
> follows. This won't work for median or max.
>
>> f2 <- function(x, startidx, endidx)  {
> +   cum <- cumsum(x)
> +   res <- cum[endidx]
> +   res[2:length(res)] <- res[2:length(res)] - cum[endidx[1:(length 
> (res) -
> 1)]]
> +   res
> + }
>> unix.time(res2 <- f2(xs, startidx, endidx))
> [1] 0.20 0.00 0.21   NA   NA

Yes that is a quite fast way to handle "sums".

> You can also use Luke Tierney's byte compiler
> (http://www.stat.uiowa.edu/~luke/R/compiler/) to speed up the loop for
> functions where you can't vectorize:
>
>> library(compiler)
>> f3 <- cmpfun(f1)
> Note: local functions used: FUN
>> unix.time(res3 <- f3(xs, startidx, endidx))
> [1] 3.84 0.00 3.91   NA   NA

That looks interesting.  Does it only work for specific operating  
systems and processors?  I will give it a try.

Thanks,

Kevin