[R] Faster Subsetting

ruipbarradas at sapo.pt ruipbarradas at sapo.pt
Wed Sep 28 18:57:15 CEST 2016


Hello,

If you work with a matrix instead of a data.frame, it usually runs  
faster, but your column vectors must all be numeric.

> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
    user  system elapsed
    0.05    0.00    0.04
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
    user  system elapsed
    0.07    0.00    0.08
>

# Make it a matrix and use the matrix
> mattmp <- as.matrix(tmp)
> system.time(replicate(500, mattmp[which(mattmp[,"id"] == idList[1]),]))
    user  system elapsed
    0.01    0.00    0.01


Hope this helps,

Rui Barradas




Citando Doran, Harold <HDoran at air.org>:

> I have an extremely large data frame (~13 million rows) that  
> resembles the structure of the object tmp below in the reproducible  
> code. In my real data, the variable, 'id' may or may not be ordered,  
> but I think that is irrelevant.
>
> I have a process that requires subsetting the data by id and then  
> running each smaller data frame through a set of functions. One  
> example below uses indexing and the other uses an explicit call to  
> subset(), both return the same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of  
> rows to evaluate the condition and this is expensive and a  
> bottleneck in my code.  I'm curious if anyone can recommend an  
> improvement that would somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list