[R] complexity of operations in R

Fri Jul 20 14:19:42 CEST 2012

See below for the complete mail to which I reply which was not sent to rhelp.

> ==========
>
> emptyexpandlist2<-list(ne=0,l=array(NA, dim=c(1, 1000L)),len=1000L)
>
> addexpandlist2<-function(x,prev){
>   if(prev$len==prev$ne){
>     n2<-prev$len*2
>     prev <- list(ne=prev$ne, l=array(prev$l, dim=c(1, n2)), len=n2)
>   }
>   prev$ne<-prev$ne+1
>   prev$l[prev$ne]<-x
>   return(prev)
> }
>
> compressexpandlist2<-function(prev){
>   return(prev$l[seq.int(prev$ne)])
> }
>
> h3<-function(dotot){
>   v<-emptyexpandlist2
>   for(i in 1:dotot){
>     v<-addexpandlist2(FALSE,v)
>   }
>   return(compressexpandlist2(v))
> }
>
> =======

The problem with your addexpandlist2 is that R in principle works with  
pass by value (certainly when you modify the objects you pass in your  
function as you do with prev). Therefore, when you pass your list to  
addexpendlist2 it makes a copy of the entire list.

You can avoid that by using environments that are passed by reference.  
The code below shows an example of this. If you would like to  
implement something like that I would recommend using reference  
classes (see ?ReferenceClasses) . Personally I don't find the 'messy  
code' that messy. You get used to it.

myvector <- function(N = 1000) {
     data <- vector("list", N)
     n    <- 0

     append <- function(d) {
         n <<- n + 1
         if (n > N) {
             N <<- 2*N
             length(data) <<- N
         }
         data[[n]] <<- d
     }

     length <- function() {
         return(n)
     }

     get <- function() {
         return(data[seq_len(n)])
     }

     return(list(append=append, length=length, get=get))
}

h4 <- function(dotot){
     v <- myvector()
     for(i in seq_len(dotot)) {
         v$append(FALSE)
     }
     return(v$get())
}

> system.time(h3(1E5))
    user  system elapsed
  22.846   0.536  23.407
> system.time(h4(1E5))
    user  system elapsed
   0.700   0.000   0.702

Jan

Johan Henriksson <mahogny at areta.org> schreef:

> On Thu, Jul 19, 2012 at 5:02 PM, Jan van der Laan <rhelp at eoos.dds.nl> wrote:
>
>> Johan,
>>
>> Your 'list' and 'array doubling' code can be written much more efficient.
>>
>> The following function is faster than your g and easier to read:
>>
>> g2 <- function(dotot) {
>>   v <- list()
>>   for (i in seq_len(dotot)) {
>>     v[[i]] <- FALSE
>>   }
>> }
>>
>
> the reason for my highly convoluted code was to simulate a linked list - to
> my knowledge a list() in R is not a linked list but a vector. I was
> assuming that R would not copy the entire memory into each sublist but
> rather keep a pointer. if this had worked, it would also be possible to
> create fancier data structures like trees and heaps
>
> http://en.wikipedia.org/wiki/Linked_list
> http://en.wikipedia.org/wiki/Heap_%28data_structure%29
>
>
>>
>>
>> In the following line in you array doubling function
>>
>> >      prev$l<-rbind(prev$l,matrix(**ncol=1,nrow=nextsize))
>>
>> you first create a new array: the second half of your new array. Then
>> rbind creates a new array and has to copy the contents of both into this
>> new array. The following routine is much faster and almost scales linearly
>> (see below):
>>
>> h2 <- function(dotot) {
>>   n <- 1000L
>>   v <- array(NA, dim=c(1, n))
>>   for(i in seq_len(dotot)) {
>>     if (i > n) {
>>       n <- 2*n
>>       v <- array(v, dim=c(1, n))
>>     }
>>     v[, i] <- TRUE
>>   }
>>   return(v[, seq_len(i)])
>> }
>>
>
>
> that's blazingly fast! thanks! I've also learned some nice optimization
> tricks, like L, and seq.int, and the resizing with array...
>
> given that this works, as you see, it's rather messy to use on a day-to-day
> basis. my goal was next to hide to this in a couple of convenient functions
> (emptyexpandlist and addexpandlist) so that this can be reused without
> cluttering otherwise fine code. but the overhead of the function call, and
> the tuple, seems to kill off the major advantage. that said, I present
> these convenience functions below:
>
> ==========
>
> emptyexpandlist2<-list(ne=0,l=array(NA, dim=c(1, 1000L)),len=1000L)
>
> addexpandlist2<-function(x,prev){
>   if(prev$len==prev$ne){
>     n2<-prev$len*2
>     prev <- list(ne=prev$ne, l=array(prev$l, dim=c(1, n2)), len=n2)
>   }
>   prev$ne<-prev$ne+1
>   prev$l[prev$ne]<-x
>   return(prev)
> }
>
> compressexpandlist2<-function(prev){
>   return(prev$l[seq.int(prev$ne)])
> }
>
> h3<-function(dotot){
>   v<-emptyexpandlist2
>   for(i in 1:dotot){
>     v<-addexpandlist2(FALSE,v)
>   }
>   return(compressexpandlist2(v))
> }
>
> =======
>
> I haven't checked the scaling but take it works as it should. the constant
> factor is really bad though:
>
> dotot=50000
>> system.time(f(dotot))
>    user  system elapsed
>   5.250   0.020   5.279
>> system.time(h(dotot))
>    user  system elapsed
>   2.650   0.060   2.713
>> system.time(h2(dotot))
>    user  system elapsed
>   0.140   0.000   0.148
>> system.time(h3(dotot))
>    user  system elapsed
>   2.480   0.020   2.495
>
> still better than without the optimization though, and pretty much as
> readable.
> moral of the story: it seems possible to write fast R, but the code won't
> look pretty
>
> thanks for the answers!
>
>
>
>
>
>
>>
>> Storing the data column wise makes it easier to increase the size of the
>> array.
>>
>> As a reference for the timing I use the following routine in which I
>> assume I know the size of the end result.
>>
>> ref <- function(dotot) {
>>   v <- array(NA, dim=c(1, dotot))
>>   for(i in seq_len(dotot)) {
>>     v[, i] <- FALSE
>>   }
>>   return(v)
>> }
>>
>>
>> Timing the different routines:
>>
>> dotot <- c(10, 100, 200, 500, 1000, 2000, 5000, 10000,
>>     20000, 50000, 100000)
>> times <- array(NA, dim=c(length(dotot), 5))
>>
>> i <- 1
>> for (n in dotot) {
>>   cat(n, "\n")
>>
>>   times[i,1] <- system.time(f(n))[3]
>>   #times[i,2] <- system.time(g(n))[3]
>>   times[i,2] <- system.time(g2(n))[3]
>>   times[i,3] <- system.time(h(n))[3]
>>   times[i,4] <- system.time(h2(n))[3]
>>   times[i,5] <- system.time(ref(n))[3]
>>
>>   i <- i + 1
>> }
>>
>>
>>         [,1]   [,2]   [,3]  [,4]  [,5]
>>  [1,]  0.000  0.000  0.000 0.001 0.000
>>  [2,]  0.001  0.000  0.001 0.000 0.000
>>  [3,]  0.001  0.000  0.002 0.000 0.000
>>  [4,]  0.003  0.002  0.007 0.002 0.001
>>  [5,]  0.009  0.006  0.013 0.002 0.003
>>  [6,]  0.031  0.020  0.032 0.006 0.004
>>  [7,]  0.181  0.099  0.098 0.016 0.010
>>  [8,]  0.722  0.370  0.272 0.032 0.020
>>  [9,]  2.897  1.502  0.766 0.066 0.044
>> [10,] 18.681 11.770  4.465 0.162 0.103
>> [11,] 77.757 57.960 17.912 0.322 0.215
>>
>>
>> The speed of the array doubling function is comparable to the function
>> where we know the size of the end result and scales almost linearly. (The
>> code is a bit messy however)
>>
>> Jan
>>
>>
>>
>>
>> On 07/17/2012 10:58 PM, Johan Henriksson wrote:
>>
>>> thanks for the link! I should read it through. that said, I didn't find
>>> any
>>> good general solution to the problem so here I post some attempts for
>>> general input. maybe someone knows how to speed this up. both my solutions
>>> are theoretically O(n) for creating a list of n elements. The function to
>>> improve is O(n^2) which should suck tremendously - but the slow execution
>>> of R probably blows up the constant factor of the smarter solutions.
>>>
>>> Array doubling comes close in speed for large lists but it would be great
>>> if it could be comparable for smaller lists. One hidden cost I see
>>> directly
>>> is that allocating a list in R is O(n), not O(1) (or close), since it
>>> always fills it with values. Is there a way around this? I guess by using
>>> C, one could just malloc() and leave the content undefined - but is there
>>> no better way?
>>>
>>> thanks,
>>> /Johan
>>>
>>>
>>> ##############################**##
>>> # the function we wish to improve
>>>
>>> f<-function(dotot){
>>>    v<-matrix(ncol=1,nrow=0)
>>>    for(i in 1:dotot){
>>>      v<-rbind(v,FALSE)
>>>    }
>>>    return(v)
>>> }
>>>
>>> ##########################
>>> # first attempt: linked lists
>>>
>>> emptylist <- NA
>>>
>>> addtolist <- function(x,prev){
>>>    return(list(x,prev))
>>> }
>>>
>>> g<-function(dotot){
>>>    v<-emptylist
>>>    for(i in 1:dotot){
>>>      v<-addtolist(FALSE,v)
>>>    }
>>>    return(v)
>>> }
>>>
>>> ##############################**######
>>> # second attempt: array doubling
>>>
>>> emptyexpandlist<-list(nelem=0,**l=matrix(ncol=1,nrow=0))
>>>
>>> addexpandlist<-function(x,**prev){
>>>    if(nrow(prev$l)==prev$nelem){
>>>      nextsize<-max(nrow(prev$l),1)
>>>      prev$l<-rbind(prev$l,matrix(**ncol=1,nrow=nextsize))
>>>    }
>>>    prev$nelem<-prev$nelem+1
>>>    prev$l[prev$nelem]<-x
>>>    return(prev)
>>> }
>>>
>>> compressexpandlist<-function(**prev){
>>>    return(as.vector(prev$l[1:**prev$nelem]))
>>> }
>>>
>>> h<-function(dotot){
>>>    v<-emptyexpandlist
>>>    for(i in 1:dotot){
>>>      v<-addexpandlist(FALSE,v)
>>>    }
>>>    return(compressexpandlist(v))
>>> }
>>>
>>> ##############################**###########
>>>
>>> dotot=100000
>>> system.time(f(dotot))
>>> #system.time(g(dotot))
>>> system.time(h(dotot))
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 17, 2012 at 8:42 PM, Patrick Burns <pburns at pburns.seanet.com>
>>> **wrote:
>>>
>>>  Johan,
>>>>
>>>> If you don't know 'The R Inferno', it might
>>>> help a little.  Circle 2 has an example of
>>>> how to efficiently (relatively speaking) grow
>>>> an object if you don't know the final length.
>>>>
>>>> http://www.burns-stat.com/****pages/Tutor/R_inferno.pdf<http://www.burns-stat.com/**pages/Tutor/R_inferno.pdf>
>>>> <http**://www.burns-stat.com/pages/**Tutor/R_inferno.pdf<http://www.burns-stat.com/pages/Tutor/R_inferno.pdf>
>>>> >
>>>>
>>>>
>>>> If you gave a simple example of how your code
>>>> looks now and what you want it to do, then you
>>>> might get some ideas of how to improve it.
>>>>
>>>>
>>>> Pat
>>>>
>>>>
>>>> On 17/07/2012 12:47, Johan Henriksson wrote:
>>>>
>>>>  Hello!
>>>>> I am optimizing my code in R and for this I need to know a bit more
>>>>> about
>>>>> the internals. It would help tremendously if someone could link me to a
>>>>> page with O()-complexities of all the operations.
>>>>>
>>>>> In this particular case, I need something like a linked list with O(1)
>>>>> insertLast/First ability. I can't preallocate a vector since I do not
>>>>> know
>>>>> the final size of the list ahead of time.
>>>>>
>>>>> The classic array-doubling trick would give me O(1) amortized time but
>>>>> it's
>>>>> a bit messy. The other alternative I see would be to recursively store
>>>>> lists (elem, (elem, (elem, (...)))), which I take also would work? But
>>>>> I'd
>>>>> rather go for a standard R solution if there is one!
>>>>>
>>>>> cheers,
>>>>> /Johan
>>>>>
>>>>>
>>>>>  --
>>>> Patrick Burns
>>>> pburns at pburns.seanet.com
>>>> twitter: @portfolioprobe
>>>> http://www.portfolioprobe.com/****blog<http://www.portfolioprobe.com/**blog><
>>>> http://www.portfolioprobe.**com/blog<http://www.portfolioprobe.com/blog>
>>>> >
>>>>
>>>> http://www.burns-stat.com
>>>> (home of 'Some hints for the R beginner'
>>>> and 'The R Inferno')
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>
>
> --
> --
> -----------------------------------------------------------
> Johan Henriksson, PhD
> Karolinska Institutet
> Ecobima AB - Custom solutions for life sciences
> http://www.ecobima.se <http://www.ecobima.com>  http://mahogny.areta.org
> http://www.endrov.net
>
> <http://www.endrov.net>