[R] Thoughts for faster indexing

Thu Nov 21 21:41:52 CET 2013

Not sure this helps but...

######
# data frame with 30,000 ID's, each with 5 "dates", plus some random data...
df <- data.frame(id=rep(1:30000, each=5), 
                                  date=rep(1:5, each=30000),
                                  x=rnorm(150000), y=rnorm(150000, mean=1),z=rnorm(150000,mean=3))
dt <- data.table(dt, key=id)      # note you have to set the  key...

# No difference when using which
system.time(for (i in 1:300) {j <- which(df$id==i)})
  user  system elapsed
  0.73    0.06    0.79

system.time(for (i in 1:300) {j <- which(dt$id==i)})
  user  system elapsed
  0.69    0.04    0.76

# 20 X faster using joins
system.time(for (i in 1:300) {select <- df[df$id==i,]})
  user  system elapsed
  19.25    0.36   19.64 
system.time(for (i in 1:300) {select <- dt[id==i,]})
  user  system elapsed
  4.32    0.11    4.45 
system.time(for (i in 1:300) {select <- dt[J(i)]})
  user  system elapsed
  0.88    0.00    0.88
######

Note that extracting select with a data table join still took longer than generating an "index" using which, but having all the
columns in one step, instead of just the index might speed up later operations.

-----Original Message-----
From: Noah Silverman [mailto:noahsilverman at g.ucla.edu] 
Sent: Wednesday, November 20, 2013 3:17 PM
To: 'R-help at r-project.org'
Subject: [R] Thoughts for faster indexing

Hello,

I have a fairly large data.frame.  (About 150,000 rows of 100
variables.) There are case IDs, and multiple entries for each ID, with a date stamp.  (i.e. records of peoples activity.)

I need to iterate over each person (record ID) in the data set, and then process their data for each date.  The processing part is
fast, the date part is fast.  Locating the records is slow.  I've even tried using data.table, with ID set as the index, and it is
still slow.

The line with the slow process (According to Rprof) is:

j <- which( d$id == person )

(I then process all the records indexed by j, which seems fast enough.)

where d is my data.frame or data.table

I thought that using the data.table indexing would speed things up, but not in this case.

Any ideas on how to speed this up?

Thanks!

--
Noah Silverman, M.S., C.Phil
UCLA Department of Statistics
8117 Math Sciences Building
Los Angeles, CA 90095