[R] Has R recently made performance improvements in accumulation?

Tue Jul 19 15:40:50 CEST 2016

Subtitle: or, more likely, am I benchmarking wrong?

I am new to R, but I have read that R is a hive of performance pitfalls.  A very common case is if you try to accumulate results into a typical immutable R data structure.

Exhibit A for this confusion is this StackOverflow question on an algorithm for O(1) performance in appending to a list:
    https://stackoverflow.com/questions/2436688/append-an-object-to-a-list-in-r-in-amortized-constant-time-o1
The original question was asked in 2010-03-13, and responses have trickled in for at least the next 5 years.

Before mentioning my problem, I instead offer an example from someone vastly better than me in R, which I am sure should be free of mistakes.  Consider this interesting article on efficient accumulation in R:
    http://www.win-vector.com/blog/2015/07/efficient-accumulation-in-r/

In that article's first example, it claims that the "obvious “for-loop” solution" (accumulation into a data.frame using rbind) is:
    is incredibly slow.
    ...
    we pay the cost of processing n*(n+1)/2 rows of data

In other words, what looks like an O(n) algorithm is actually O(n^2).

Sounds awful.  But when I tried executing their example on my machine, I see O(n) NOT O(n^2) behavior!

Here is the complete code that I executed:

    mkFrameForLoop <- function(nRow,nCol) {
        d <- c()
        for(i in seq_len(nRow)) {
            ri <- mkRow(nCol)
            di <- data.frame(ri, stringsAsFactors=FALSE)
            d <- rbind(d,di)
        }
        d
    }

    mkRow <- function(nCol) {
        x <- as.list(rnorm(nCol))
        # make row mixed types by changing first column to string
        x[[1]] <- ifelse(x[[1]]>0,'pos','neg')
        names(x) <- paste('x',seq_len(nCol),sep='.')
        x
    }

    t1 = Sys.time()
    x = mkFrameForLoop(100, 10)
    tail(x)
    t2 = Sys.time()
    t2 - t1

    t1 = Sys.time()
    x = mkFrameForLoop(200, 10)
    tail(x)
    t2 = Sys.time()
    t2 - t1

    t1 = Sys.time()
    x = mkFrameForLoop(400, 10)
    tail(x)
    t2 = Sys.time()
    t2 - t1

And here is what I got for the execution times:

    Time difference of 0.113005876541138 secs
    Time difference of 0.205012083053589 secs
    Time difference of 0.390022993087769 secs

That's a linear, not quadratic, increase in execution time as a function of nRow!  It is NOT what I was expecting, altho it was nice to see.

Notes:
    --the functions above are copy and pastes from the article
    --the execution time measurement code is all that I added
        --yes, that time measurement code is crude
        --I am aware of R benchmarking frameworks, in fact, I first tried using some of them
        --but when I got unexpected results, I went back to the simplest code possible, which is the above
        --it turns out that I get the same results regardless of how I measure
    --each measurement doubles the number of rows (from 100 to 200 to 400), which is what should be increasing to bring out rbind's allegedly bad behavior
    --my machine has a Xeon E3 1245, 8 GB of RAM, running Win 7 Pro 64 bit, using R 3.3.1 (latest release as of today)

Given those unexpected results, you now understand the title of my post.  So, has R's runtime somehow gotten greater intelligence recently so that it knows when it does not need to copy a data structure?  Maybe in the case above, it does a quick shallow copy instead of a deep copy?  Or, am I benchmarking wrong?

What motivated my investigation is that I want to store a bunch of Monte Carlo simulation results in a numeric matrix.  When I am finished with my code, I know that it will be a massive matrix (e.g. 1,000,000 or more rows).  So I was concerned about performance, and went about benchmarking.  Let me know if you want to see that benchmark code.  I found that assignment to a matrix does not seem to generate a copy, even tho assignment is a mutating operation, so I was worried that it might.  But when I investigated deeper, such as with the above code, I got worried that I cannot trust my results.

I look forward to any feedback that you can give.

I would especially appreciate any links that explain how you can determine what the R runtime actually does.