[R] Fastest way to repeatedly subset a data frame?

Tony Plate tplate at acm.org
Fri Apr 20 23:20:53 CEST 2007


This type of information about speeds of various techniques can really 
only be found out by trying things out, especially because R-core has 
recently made a fair number of improvements to some of the underlying 
code in R.  That's part of the reason I put these tests together -- I 
wanted to know for myself what sort of speed differences there was now 
among the various approaches.

-- Tony Plate

Iestyn Lewis wrote:
> This is fantastic.  I just tested the first match() method and it is 
> acceptably fast.  I'll look into some of the even better methods 
> later.   Thank you for taking the time to put this together.
> 
> Is this kind of optimization information on the web anywhere?  I can 
> imagine that a lot of people have slow sets of commands that could be 
> optimized with this kind of knowledge. 
> 
> Thank you so much,
> 
> Iestyn
> 
> Tony Plate wrote:
>> Here's some timings on seemingly minor variations of data structure 
>> showing timings ranging by a factor of 100 (factor of 3 if the worst 
>> is omitted).  One of the keys is to avoid use of the partial string 
>> match that happens with ordinary data frame subscripting.
>>
>> -- Tony Plate
>>
>>> n <- 10000 # number of rows in data frame
>>> k <- 500   # number of vectors in indexing list
>>> # use a data frame with regular row names and id as factor (defaults 
>> for data.frame)
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
>> result=seq(len=n), stringsAsFactors=TRUE)
>>> object.size(df)
>> [1] 440648
>>> df[1:3,,drop=FALSE]
>>    id result
>> 1 ID1      1
>> 2 ID2      2
>> 3 ID3      3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
>>    user  system elapsed
>>    3.00    0.00    3.03
>>> # use a data frame with automatic row names (should be low overhead) 
>> and id as factor
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
>> result=seq(len=n), row.names=NULL, stringsAsFactors=TRUE)
>>> object.size(df)
>> [1] 440648
>>> df[1:3,,drop=FALSE]
>>    id result
>> 1 ID1      1
>> 2 ID2      2
>> 3 ID3      3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
>>    user  system elapsed
>>    2.68    0.00    2.70
>>> # use a data frame with automatic row names (should be low overhead) 
>> and id as character
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
>> result=seq(len=n), row.names=NULL, stringsAsFactors=FALSE)
>>> object.size(df)
>> [1] 400448
>>> df[1:3,,drop=FALSE]
>>    id result
>> 1 ID1      1
>> 2 ID2      2
>> 3 ID3      3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
>>    user  system elapsed
>>    1.54    0.00    1.59
>>> # use a data frame with ids as the row names & subscripting for 
>> matching (should be high overhead)
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
>> result=seq(len=n), row.names="id")
>>> object.size(df)
>> [1] 400384
>>> df[1:3,,drop=FALSE]
>>     result
>> ID1      1
>> ID2      2
>> ID3      3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[i,,drop=FALSE]))
>>    user  system elapsed
>>  109.15    0.04  111.28
>>> # use a data frame with ids as the row names & match()
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
>> result=seq(len=n), row.names="id")
>>> object.size(df)
>> [1] 400384
>>> df[1:3,,drop=FALSE]
>>     result
>> ID1      1
>> ID2      2
>> ID3      3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[match(i, 
>> rownames(df)),,drop=FALSE]))
>>    user  system elapsed
>>    1.53    0.00    1.58
>>> # use a named numeric vector to store the same data as was stored in 
>> the data frame
>>> x <- seq(len=n)
>>> names(x) <- paste("ID", seq(len=n), sep="")
>>> object.size(x)
>> [1] 400104
>>> x[1:3]
>> ID1 ID2 ID3
>>   1   2   3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) x[match(i, names(x))]))
>>    user  system elapsed
>>    1.14    0.05    1.19
>>
>>
>>
>>
>> Iestyn Lewis wrote:
>>> Good tip - an Rprof trace over my real data set resulted in a file 
>>> filled with:
>>>
>>> pmatch [.data.frame [ FUN lapply
>>> pmatch [.data.frame [ FUN lapply
>>> pmatch [.data.frame [ FUN lapply
>>> pmatch [.data.frame [ FUN lapply
>>> pmatch [.data.frame [ FUN lapply
>>> ...
>>> with very few other calls in there.  pmatch seems to be the string 
>>> search function, so I'm guessing there's no hashing going on, or not 
>>> very good hashing.
>>>
>>> I'll let you know how the environment option works - the Bioconductor 
>>> project seems to make extensive use of it, so I'm guessing it's the 
>>> way to go.
>>>
>>> Iestyn
>>>
>>> hadley wickham wrote:
>>>>> But... it's not any faster, which is worrisome to me because it seems
>>>>> like your code uses rownames and would take advantage of the hashing
>>>>> potential of named items.
>>>> I'm pretty sure it will use a hash to access the specified rows.
>>>> Before you pursue an environment based solution, you might want to
>>>> profile the code to check that the hashing is actually the slowest
>>>> part - I suspect creating all new data.frames is taking the most time.
>>>>
>>>> Hadley
>>> ______________________________________________
>>> R-help at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list