[R] SLOW split() function

Tue Oct 11 03:01:02 CEST 2011

dear R experts:  apologies for all my speed and memory questions.  I
have a bet with my coauthors that I can make R reasonably efficient
through R-appropriate programming techniques.  this is not just for
kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
2.8GHz Xeons, 16GB of RAM, and R 2.13.1.

right now, it seems that 'split()' is why I am losing my bet.  (split
is an integral component of *apply() and by(), so I need split() to be
fast.  its resulting list can then be fed, e.g., to mclapply().)  I
made up an example to illustrate my ills:

    library(data.table)
    N <- 1000
    T <- N*10
    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
    setkey(d, "key"); gc() ## force a garbage collection
    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
    print(system.time( s<-split(d, d$key) ))

My ordered input data table (or data frame; doesn't make a difference)
is 114MB in size.  it takes about a second to create.  split() only
needs to reshape it.  this simple operation takes almost 5 minutes on
my computer.

with a data set that is larger, this explodes further.

am I doing something wrong?  is there an alternative to split()?

sincerely,

/iaw

----
Ivo Welch (ivo.welch at gmail.com)