[R] More efficient option to append()?

Fri Aug 19 22:23:46 CEST 2011

On 19.08.2011 15:50, Paul Hiemstra wrote:
>   On 08/17/2011 10:53 PM, Alex Ruiz Euler wrote:
>> Dear R community,
>>
>> I have a 2 million by 2 matrix that looks like this:
>>
>> x<-sample(1:15,2000000, replace=T)
>> y<-sample(1:10*1000, 2000000, replace=T)
>>        x     y
>> [1,] 10  4000
>> [2,]  3  1000
>> [3,]  3  4000
>> [4,]  8  6000
>> [5,]  2  9000
>> [6,]  3  8000
>> [7,]  2 10000
>> (...)
>>
>>
>> The first column is a population expansion factor for the number in the
>> second column (household income). I want to expand the second column
>> with the first so that I end up with a vector beginning with 10
>> observations of 4000, then 3 observations of 1000 and so on. In my mind
>> the natural approach would be to create a NULL vector and append the
>> expansions:
>>
>> myvar<-NULL
>> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>>
>> for (i in 2:length(x)) {
>> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
>> }
>>
>> to end with a vector of sum(x), which in my real database corresponds
>> to 22 million observations.
>>
>> This works fine --if I only run it for the first, say, 1000
>> observations. If I try to perform this on all 2 million observations
>> it takes long, way too long for this to be useful (I left it running
>> 11 hours yesterday to no avail).
>>
>>
>> I know R performs well with operations on relatively large vectors. Why
>> is this so inefficient? And what would be the smart way to do this?
>
> Hi Alex,
>
> The other reply already gave you the R way of doing this while avoiding
> the for loop. However, there is a more general reason why your for loop
> is terribly inefficient. A small set of examples:
>
> largeVector = runif(10e4)
> outputVector = NULL
> system.time(for(i in 1:length(largeVector)) {

Please do teach people to use seq_along(largeVector) rather than 
1:length(largeVector) (the latter is not save in case of length 0 objects).

Uwe Ligges

>      outputVector = append(outputVector, largeVector[i] + 1)
> })
> #   user  system elapsed
>   # 6.591   0.168   6.786
>
> The problem in this code is that outputVector keeps on growing and
> growing. The operating system needs to allocate more and more space as
> the object grows. This process is really slow. Several (much) faster
> alternatives exist:
>
> # Pre-allocating the outputVector
> outputVector = rep(0,length(largeVector))
> system.time(for(i in 1:length(largeVector)) {
>      outputVector[i] = largeVector[i] + 1
> })
> #   user  system elapsed
> # 0.178   0.000   0.178
> # speed up of 37 times, this will only increase for large
> # lengths of largeVector
>
> # Using apply functions
> system.time(outputVector<- sapply(largeVector, function(x) return(x + 1)))
> #   user  system elapsed
> #  0.124   0.000   0.125
> # Even a bit faster
>
> # Using vectorisation
> system.time(outputVector<- largeVector + 1)
> #   user  system elapsed
> #  0.000   0.000   0.001
> # Practically instant, 6780 times faster than the first example
>
> It is not always clear which method is most suitable and which performs
> best. At least they all perform much, much better than the naive option
> of letting outputVector grow.
>
> cheers,
> Paul
>
>> Thanks in advance.
>> Alex
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>