[R] Faster Subsetting

Thu Sep 29 00:55:32 CEST 2016

Hi Harold,

Generally: you can not beat data.table, unless you can represent your 
data in a matrix (or array or vector). For some specific cases, Hervé's 
suggestion might be also competitive.
Your problem is that you did not put any effort to read at least part of 
the very extensive documentation of the data.table package. You should 
start here: https://github.com/Rdatatable/data.table/wiki/Getting-started

To put in a nutshell: use a key which allows binary search instead of 
the much-much slower vector scan. (With the automatic auto-indexing 
feature of the data.table package, you may even skip this step.) The 
point is that creating the key must be done only once, and all 
subsequent subsetting operations which use the key become incredibly 
fast. You missed this point because you replicated the creation of the 
key as well, not only the subsetting in one of your examples.

Here is a version of Herve's example (OK, it is a bit biased because 
data.table has a highly optimized internal version of mean() for 
calculating the group means):

## create a keyed data.table
tmp_dt <- data.table(id = rep(1:20000, each = 10), foo = rnorm(200000), 
key = "id")
system.time(tmp_dt[, .(result = mean(foo)), by = id])
# user system elapsed
# 0.004 0.000 0.005

## subset a keyed data.table
all_ids <- tmp_dt[, unique(id)]
select_id <- sample(all_ids, 1)
system.time(tmp_dt[.(select_id)])
# user system elapsed
# 0.000 0.000 0.001

## or equivalently
system.time(tmp_dt[id == select_id])
# user system elapsed
# 0.000 0.000 0.001

Note: the CRAN version of the data.table package is already very fast, 
but you should try the developmental version ( 
devtools::install_github("Rdatatable/data.table") ) for multi-threaded 
subsetting.

Cheers,
Denes

On 09/28/2016 08:53 PM, Hervé Pagès wrote:
 > Hi,
 >
 > I'm surprised nobody suggested split(). Splitting the data.frame
 > upfront is faster than repeatedly subsetting it:
 >
 >    tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000))
 >    idList <- unique(tmp$id)
 >
 >    system.time(for (i in idList) tmp[which(tmp$id == i),])
 >    #   user  system elapsed
 >    # 16.286   0.000  16.305
 >
 >    system.time(split(tmp, tmp$id))
 >    #   user  system elapsed
 >    #  5.637   0.004   5.647
 >
 > Cheers,
 > H.
 >
 > On 09/28/2016 09:09 AM, Doran, Harold wrote:
 >> I have an extremely large data frame (~13 million rows) that resembles
 >> the structure of the object tmp below in the reproducible code. In my
 >> real data, the variable, 'id' may or may not be ordered, but I think
 >> that is irrelevant.
 >>
 >> I have a process that requires subsetting the data by id and then
 >> running each smaller data frame through a set of functions. One
 >> example below uses indexing and the other uses an explicit call to
 >> subset(), both return the same result, but indexing is faster.
 >>
 >> Problem is in my real data, indexing must parse through millions of
 >> rows to evaluate the condition and this is expensive and a bottleneck
 >> in my code.  I'm curious if anyone can recommend an improvement that
 >> would somehow be less expensive and faster?
 >>
 >> Thank you
 >> Harold
 >>
 >>
 >> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
 >>
 >> idList <- unique(tmp$id)
 >>
 >> ### Fast, but not fast enough
 >> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
 >>
 >> ### Not fast at all, a big bottleneck
 >> system.time(replicate(500, subset(tmp, id == idList[1])))
 >>
 >> ______________________________________________
 >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
 >> https://stat.ethz.ch/mailman/listinfo/r-help
 >> PLEASE do read the posting guide
 >> http://www.R-project.org/posting-guide.html
 >> and provide commented, minimal, self-contained, reproducible code.
 >>
 >