[R] Faster Subsetting

Wed Sep 28 20:30:14 CEST 2016

On Wed, 28 Sep 2016, "Doran, Harold" <HDoran at air.org> writes:

> I have an extremely large data frame (~13 million rows) that resembles
> the structure of the object tmp below in the reproducible code. In my
> real data, the variable, 'id' may or may not be ordered, but I think
> that is irrelevant.
>
> I have a process that requires subsetting the data by id and then
> running each smaller data frame through a set of functions. One
> example below uses indexing and the other uses an explicit call to
> subset(), both return the same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of
> rows to evaluate the condition and this is expensive and a bottleneck
> in my code.  I'm curious if anyone can recommend an improvement that
> would somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>

If you really need only one column, it will be faster
to extract that column and then to take a subset of it:

  system.time(replicate(500, tmp[[2L]][tmp$id == idList[1L]]))

(A data.frame is a list of atomic vectors, and it is
 typically faster to first extract the component of
 interest, i.e. the specific column, and then to subset
 this vector. The result will, of course, be a vector,
 not a data.frame.)

-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net