[R] How to speed up list access in R?

Thomas Nyberg tomnyberg at gmail.com
Thu Oct 30 18:14:20 CET 2014


(CCing the list because I definitely think it would be useful to have 
this in the archives.)

Thanks so much! It's clear that I'm going to need to read this email 
plus references quite a few times before things sink in. I guess I'll 
have to change my point of view entirely for this language. I understand 
much of the theory behind functional languages and have made much use of 
vectorization in programming in the past, but I tended to only do that 
when absolutely necessary to avoid complexity. I guess I really will 
have to fundamentally change my programming style and turn that theory 
into practice.

Thanks again everyone!

Cheers,
Thomas

On 10/30/2014 12:58 PM, Dennis Murphy wrote:
> Hi:
>
> You're making several fundamental mistakes that people who move to R
> from a conventional programming language often make. The ones that are
> immediately apparent are:
>
> (i) Initializing empty objects.
> (ii) Populating the object one element at a time.
> (iii) An over-reliance on for loops to do the job.
>
> One of the most important features of R is the concept of
> vectorization, which basically means "program the entire object, not
> its pieces". Item (ii) usually guarantees slow processing because R
> uses lazy evaluation, which means that in cases like yours, the memory
> used in each iteration is accumulated until the loop is complete,
> which in turn means that iterations will get progressively longer as R
> has to search harder for memory. One quick fix to (ii) is to
> pre-allocate the required space, but the better fix is to use
> vectorization whenever possible, because then the actual looping takes
> place at the compiler level rather than inside the interpreter - the
> former is generally much faster.
>
> If I read your code properly, you may have meant to do something like this:
>
> numbers <- sample(seq(30000), 50000, replace = TRUE)
> values <- sample(seq(10), 50000, replace = TRUE)
> L <- split(numbers, values)
> sapply(L, length)    # returns the length of each of the ten
> subvectors in the list
>
> The time it takes to create the list on my system:
>
>> system.time({
> + numbers <- sample(seq(30000), 50000, replace = TRUE)
> + values <- sample(seq(10), 50000, replace = TRUE)
> + L <- split(numbers, values)  })
>     user  system elapsed
>        0       0       0
>
> The length of the numeric vectors associated with each value:
>
> sapply(L, length)
>     1    2    3    4    5    6    7    8    9   10
> 4976 4968 4912 5111 5009 5074 4987 4950 5166 4847
>
> Notice the following features of my code:
>
> * no object initialization
> * no loops
>
>
> More comments inline.
>
> On Thu, Oct 30, 2014 at 8:17 AM, Thomas Nyberg <tomnyberg at gmail.com> wrote:
>> Hello,
>>
>> I want to do the following: Given a set of (number, value) pairs, I want to
>> create a list l so that l[[toString(number)]] returns the vector of values
>> associated to that number. It is hundreds of times slower than the
>> equivalent that I would write in python. I'm pretty new to R so I bet I'm
>> using its data structures inefficiently, but I've tried more or less
>> everything I can think of and can't really speed it up. I have done some
>> profiling which helped me find problem areas, but I couldn't speed things up
>> even with that information. I'm thinking I'm just fundamentally using R in a
>> silly way.
>>
>> I've included code for the different versions. I wrote the python code in a
>> style to make it as clear to R programmers as possible. Thanks a lot! Any
>> help would be greatly appreciated!
>>
>> Cheers,
>> Thomas
>>
>>
>> R code (with two versions depending on commenting):
>>
>> -----
>>
>> numbers <- numeric(0)
>> for (i in 1:5) {
>>      numbers <- c(numbers, sample(1:30000, 10000))
>> }
>>
>> values <- numeric(0)
>> for (i in 1:length(numbers)) {
>>      values <- append(values, sample(1:10, 1))
>> }
>
> The two loops above are classic cases of initializing empty vectors
> and populating them one element at a time. This may be efficient in
> Python, C or Java, but it is not so in R because R uses lazy
> evaluation, which means that an expression is not evaluated until its
> values are needed. In this case, they are needed at the end of the
> loop.
>
> When you have the time, download Patrick Burns' classic R Inferno,
> which covers this problem as well as many other "gotchas" that bite
> the inexperienced R programmer.
>>
>>             starttime <- Sys.time()
>>
>> d = list()
>> for (i in 1:length(numbers)) {
>>      number = toString(numbers[i])
>>      value = values[i]
>>      if (is.null(d[[number]])) {
>>      #if (number %in% names(d)) {
>>          d[[number]] <- c(value)
>>      } else {
>>          d[[number]] <- append(d[[number]], value)
>>      }
>> }
>
> (1) Conversion to string is unnecessary.
> (2) Use of append() is unnecessary. As Bill mentioned, split()
> performs the same purpose as your code.
>
> One use of list objects in R allows for its components to have the
> same type but different lengths.
>
> The takeaway: efficient programming in Python does not necessarily
> translate to efficient programming in R. You need to think somewhat
> differently in R. My approach is to think in terms of entire objects
> rather than its elements and program accordingly, which comes in handy
> as I can do
>
> mean(x)      # find the arithmetic average of the numeric vector x
> median(x)   # find the sample median of the numeric vector x
>
> etc., which are vectorized functions. Trying to program these in a
> standard programming language requires a lot more code and as a result
> the code is clunkier and less transparent to an external reader.
>
> Dennis
>
>>
>> endtime <- Sys.time()
>>
>> print(format(endtime - starttime))
>>
>> -----
>>
>> uncommented version: "45.64791 secs"
>> commented version: "1.423056 mins"
>>
>>
>>
>> Another version of R code:
>>
>> -----
>>
>> numbers <- numeric(0)
>> for (i in 1:5) {
>>      numbers <- c(numbers, sample(1:30000, 10000))
>> }
>>
>> values <- numeric(0)
>> for (i in 1:length(numbers)) {
>>      values <- append(values, sample(1:10, 1))
>> }
>>
>> starttime <- Sys.time()
>>
>> d = list()
>> for (number in unique(numbers)) {
>>      d[[toString(number)]] <- numeric(0)
>> }
>> for (i in 1:length(numbers)) {
>>      number = toString(numbers[i])
>>      value = values[i]
>>      d[[number]] <- append(d[[number]], value)
>> }
>>
>> endtime <- Sys.time()
>>
>> print(format(endtime - starttime))
>>
>> -----
>>
>> "47.15579 secs"
>>
>>
>>
>> The python code:
>>
>> -----
>>
>> import random
>> import time
>>
>> numbers = []
>> for i in range(5):
>>      numbers += random.sample(range(30000), 10000)
>>
>> values = []
>> for i in range(len(numbers)):
>>      values.append(random.randint(1, 10))
>>
>> starttime = time.time()
>>
>> d = {}
>> for i in range(len(numbers)):
>>      number = numbers[i]
>>      value = values[i]
>>      if d.has_key(number):
>>          d[number].append(value)
>>      else:
>>          d[number] = [value]
>>
>> endtime = time.time()
>>
>> print endtime - starttime, "seconds"
>>
>> -----
>>
>> 0.123021125793 seconds
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list