[R] How to speed up list access in R?

Thu Oct 30 22:09:10 CET 2014

Or do all the subsetting in one pass - [ will use a hashmap.

Hadley

On Thu, Oct 30, 2014 at 12:05 PM, William Dunlap <wdunlap at tibco.com> wrote:
> You can try using an environment instead of a list.
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Thu, Oct 30, 2014 at 10:02 AM, Thomas Nyberg <tomnyberg at gmail.com> wrote:
>> Thanks to all for the help everyone! For the moment I'll stick with Bill's
>> solution, but I'll check out the other recommendations as well.
>>
>> Regarding the issue of slow looks ups for lists, are there any hash map
>> implementations in R that are faster? I like using fairly simple logic and
>> data structures when prototyping and then only optimize code when and where
>> it's necessary which is why I'm curious about these basic objects.
>>
>> On another note, is there a vector style implementation that changes the
>> vectors in place? If I'm not mistaken, the append operation creates and
>> returns a new vector each time which is line with the functional nature of
>> R. If there were some way to have it mutable, it could be much faster. This
>> is fairly standard in many languages. Behind the scenes memory is allocated
>> at say 2 times the current size so that you only need log(n) extensions when
>> building up a vector like this. Are there any such equivalents in R? I
>> presume that lists are mutable (am I wrong?), but they seem to have the
>> lookup slowdown problem.
>>
>> Again thanks a lot!
>>
>> Cheers,
>> Thomas
>>
>>
>> On 10/30/2014 12:05 PM, William Dunlap wrote:
>>>
>>> Repeatedly extending vectors takes a lot of time.  You can do what you
>>> want with
>>>    d2 <- split(values, factor(numbers, levels=unique(numbers)))
>>> If you would like the labels on d2 to be in numeric order then you can
>>> simplify that to
>>>    d3 <- split(values, numbers)
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>>
>>>
>>> On Thu, Oct 30, 2014 at 8:17 AM, Thomas Nyberg <tomnyberg at gmail.com>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I want to do the following: Given a set of (number, value) pairs, I want
>>>> to
>>>> create a list l so that l[[toString(number)]] returns the vector of
>>>> values
>>>> associated to that number. It is hundreds of times slower than the
>>>> equivalent that I would write in python. I'm pretty new to R so I bet I'm
>>>> using its data structures inefficiently, but I've tried more or less
>>>> everything I can think of and can't really speed it up. I have done some
>>>> profiling which helped me find problem areas, but I couldn't speed things
>>>> up
>>>> even with that information. I'm thinking I'm just fundamentally using R
>>>> in a
>>>> silly way.
>>>>
>>>> I've included code for the different versions. I wrote the python code in
>>>> a
>>>> style to make it as clear to R programmers as possible. Thanks a lot! Any
>>>> help would be greatly appreciated!
>>>>
>>>> Cheers,
>>>> Thomas
>>>>
>>>>
>>>> R code (with two versions depending on commenting):
>>>>
>>>> -----
>>>>
>>>> numbers <- numeric(0)
>>>> for (i in 1:5) {
>>>>      numbers <- c(numbers, sample(1:30000, 10000))
>>>> }
>>>>
>>>> values <- numeric(0)
>>>> for (i in 1:length(numbers)) {
>>>>      values <- append(values, sample(1:10, 1))
>>>> }
>>>>
>>>>             starttime <- Sys.time()
>>>>
>>>> d = list()
>>>> for (i in 1:length(numbers)) {
>>>>      number = toString(numbers[i])
>>>>      value = values[i]
>>>>      if (is.null(d[[number]])) {
>>>>      #if (number %in% names(d)) {
>>>>          d[[number]] <- c(value)
>>>>      } else {
>>>>          d[[number]] <- append(d[[number]], value)
>>>>      }
>>>> }
>>>>
>>>> endtime <- Sys.time()
>>>>
>>>> print(format(endtime - starttime))
>>>>
>>>> -----
>>>>
>>>> uncommented version: "45.64791 secs"
>>>> commented version: "1.423056 mins"
>>>>
>>>>
>>>>
>>>> Another version of R code:
>>>>
>>>> -----
>>>>
>>>> numbers <- numeric(0)
>>>> for (i in 1:5) {
>>>>      numbers <- c(numbers, sample(1:30000, 10000))
>>>> }
>>>>
>>>> values <- numeric(0)
>>>> for (i in 1:length(numbers)) {
>>>>      values <- append(values, sample(1:10, 1))
>>>> }
>>>>
>>>> starttime <- Sys.time()
>>>>
>>>> d = list()
>>>> for (number in unique(numbers)) {
>>>>      d[[toString(number)]] <- numeric(0)
>>>> }
>>>> for (i in 1:length(numbers)) {
>>>>      number = toString(numbers[i])
>>>>      value = values[i]
>>>>      d[[number]] <- append(d[[number]], value)
>>>> }
>>>>
>>>> endtime <- Sys.time()
>>>>
>>>> print(format(endtime - starttime))
>>>>
>>>> -----
>>>>
>>>> "47.15579 secs"
>>>>
>>>>
>>>>
>>>> The python code:
>>>>
>>>> -----
>>>>
>>>> import random
>>>> import time
>>>>
>>>> numbers = []
>>>> for i in range(5):
>>>>      numbers += random.sample(range(30000), 10000)
>>>>
>>>> values = []
>>>> for i in range(len(numbers)):
>>>>      values.append(random.randint(1, 10))
>>>>
>>>> starttime = time.time()
>>>>
>>>> d = {}
>>>> for i in range(len(numbers)):
>>>>      number = numbers[i]
>>>>      value = values[i]
>>>>      if d.has_key(number):
>>>>          d[number].append(value)
>>>>      else:
>>>>          d[number] = [value]
>>>>
>>>> endtime = time.time()
>>>>
>>>> print endtime - starttime, "seconds"
>>>>
>>>> -----
>>>>
>>>> 0.123021125793 seconds
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
http://had.co.nz/