[R] Fastest way to repeatedly subset a data frame?

Fri Apr 20 23:01:39 CEST 2007

This is fantastic.  I just tested the first match() method and it is 
acceptably fast.  I'll look into some of the even better methods 
later.   Thank you for taking the time to put this together.

Is this kind of optimization information on the web anywhere?  I can 
imagine that a lot of people have slow sets of commands that could be 
optimized with this kind of knowledge. 

Thank you so much,

Iestyn

Tony Plate wrote:
> Here's some timings on seemingly minor variations of data structure 
> showing timings ranging by a factor of 100 (factor of 3 if the worst 
> is omitted).  One of the keys is to avoid use of the partial string 
> match that happens with ordinary data frame subscripting.
>
> -- Tony Plate
>
> > n <- 10000 # number of rows in data frame
> > k <- 500   # number of vectors in indexing list
> > # use a data frame with regular row names and id as factor (defaults 
> for data.frame)
> > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
> result=seq(len=n), stringsAsFactors=TRUE)
> > object.size(df)
> [1] 440648
> > df[1:3,,drop=FALSE]
>    id result
> 1 ID1      1
> 2 ID2      2
> 3 ID3      3
> > set.seed(1)
> > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> > sum(sapply(ids, length))
> [1] 1263508
> > system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
>    user  system elapsed
>    3.00    0.00    3.03
> >
> > # use a data frame with automatic row names (should be low overhead) 
> and id as factor
> > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
> result=seq(len=n), row.names=NULL, stringsAsFactors=TRUE)
> > object.size(df)
> [1] 440648
> > df[1:3,,drop=FALSE]
>    id result
> 1 ID1      1
> 2 ID2      2
> 3 ID3      3
> > set.seed(1)
> > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> > sum(sapply(ids, length))
> [1] 1263508
> > system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
>    user  system elapsed
>    2.68    0.00    2.70
> >
> > # use a data frame with automatic row names (should be low overhead) 
> and id as character
> > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
> result=seq(len=n), row.names=NULL, stringsAsFactors=FALSE)
> > object.size(df)
> [1] 400448
> > df[1:3,,drop=FALSE]
>    id result
> 1 ID1      1
> 2 ID2      2
> 3 ID3      3
> > set.seed(1)
> > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> > sum(sapply(ids, length))
> [1] 1263508
> > system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
>    user  system elapsed
>    1.54    0.00    1.59
> >
> > # use a data frame with ids as the row names & subscripting for 
> matching (should be high overhead)
> > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
> result=seq(len=n), row.names="id")
> > object.size(df)
> [1] 400384
> > df[1:3,,drop=FALSE]
>     result
> ID1      1
> ID2      2
> ID3      3
> > set.seed(1)
> > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> > sum(sapply(ids, length))
> [1] 1263508
> > system.time(lapply(ids, function(i) df[i,,drop=FALSE]))
>    user  system elapsed
>  109.15    0.04  111.28
> >
> > # use a data frame with ids as the row names & match()
> > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
> result=seq(len=n), row.names="id")
> > object.size(df)
> [1] 400384
> > df[1:3,,drop=FALSE]
>     result
> ID1      1
> ID2      2
> ID3      3
> > set.seed(1)
> > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> > sum(sapply(ids, length))
> [1] 1263508
> > system.time(lapply(ids, function(i) df[match(i, 
> rownames(df)),,drop=FALSE]))
>    user  system elapsed
>    1.53    0.00    1.58
> >
> > # use a named numeric vector to store the same data as was stored in 
> the data frame
> > x <- seq(len=n)
> > names(x) <- paste("ID", seq(len=n), sep="")
> > object.size(x)
> [1] 400104
> > x[1:3]
> ID1 ID2 ID3
>   1   2   3
> > set.seed(1)
> > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> > sum(sapply(ids, length))
> [1] 1263508
> > system.time(lapply(ids, function(i) x[match(i, names(x))]))
>    user  system elapsed
>    1.14    0.05    1.19
> >
>
>
>
>
>
> Iestyn Lewis wrote:
>> Good tip - an Rprof trace over my real data set resulted in a file 
>> filled with:
>>
>> pmatch [.data.frame [ FUN lapply
>> pmatch [.data.frame [ FUN lapply
>> pmatch [.data.frame [ FUN lapply
>> pmatch [.data.frame [ FUN lapply
>> pmatch [.data.frame [ FUN lapply
>> ...
>> with very few other calls in there.  pmatch seems to be the string 
>> search function, so I'm guessing there's no hashing going on, or not 
>> very good hashing.
>>
>> I'll let you know how the environment option works - the Bioconductor 
>> project seems to make extensive use of it, so I'm guessing it's the 
>> way to go.
>>
>> Iestyn
>>
>> hadley wickham wrote:
>>>> But... it's not any faster, which is worrisome to me because it seems
>>>> like your code uses rownames and would take advantage of the hashing
>>>> potential of named items.
>>> I'm pretty sure it will use a hash to access the specified rows.
>>> Before you pursue an environment based solution, you might want to
>>> profile the code to check that the hashing is actually the slowest
>>> part - I suspect creating all new data.frames is taking the most time.
>>>
>>> Hadley
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>