[R] performance of do.call("rbind")

Mon Jun 27 21:33:06 CEST 2016

That's not what I said, though, and it's not necessarily true. Growing
an object within a loop _is_ a slow process, but that's not the
problem here. The problem is using data frames instead of matrices.
The need to manage column classes is very costly. Converting to
matrices will almost always be enormously faster.

Here's an expansion of the previous example I posted, in four parts:
1. do.call with data frame - very slow - 34.317 s elapsed time for
2000 data frames
2. do.call with matrix - very fast - 0.311 s elapsed
3. pre-allocated loop with data frame - even slower (!) - 82.162 s
4. pre-allocated loop with matrix - very fast - 68.009 s

It matters whether the columns are converted to numeric or character,
and the time doesn't scale linearly with list length. For a particular
problem, the best solution may vary greatly (and I didn't even include
packages beyond the base functionality). In general, though, using
matrices is faster than using data frames, and using do.call is faster
than using a pre-allocated loop, which is much faster than growing an
object.

Sarah

> testsize <- 5000
>
> set.seed(1234)
> testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3))
> testdf.list <- lapply(seq_len(testsize), function(x)testdf)
>
> system.time(r.df <- do.call("rbind", testdf.list))
   user  system elapsed
 34.280   0.009  34.317
>
> system.time({
+ testm.list <- lapply(testdf.list, as.matrix)
+ r.m <- do.call("rbind", testm.list)
+ })
   user  system elapsed
  0.310   0.000   0.311
>
> system.time({
+ l.df <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
+ for(i in seq_len(testsize)) {
+ start <- (i-1)*100 + 1
+ end <- i*100
+ l.df[start:end, ] <- testdf.list[[i]]
+ }
+ })
   user  system elapsed
 81.890   0.069  82.162
>
> system.time({
+ l.m <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
+ testm.list <- lapply(testdf.list, as.matrix)
+ for(i in seq_len(testsize)) {
+ start <- (i-1)*100 + 1
+ end <- i*100
+ l.m[start:end, ] <- testm.list[[i]]
+ }
+ })
   user  system elapsed
 67.664   0.047  68.009

On Mon, Jun 27, 2016 at 1:05 PM, Marc Schwartz <marc_schwartz at me.com> wrote:
> Hi,
>
> Just to add my tuppence, which might not even be worth that these days...
>
> I found the following blog post from 2013, which is likely dated to some extent, but provided some benchmarks for a few methods:
>
>   http://rcrastinate.blogspot.com/2013/05/the-rbinding-race-for-vs-docall-vs.html
>
> There is also a comment with a reference there to using the data.table package, which I don't use, but may be something to evaluate.
>
> As Bert and Sarah hinted at, there is overhead in taking the repetitive piecemeal approach.
>
> If all of your data frames are of the exact same column structure (column order, column types), it may be prudent to do your own pre-allocation of a data frame that is the target row total size and then "insert" each "sub" data frame by using row indexing into the target structure.
>
> Regards,
>
> Marc Schwartz
>
>
>> On Jun 27, 2016, at 11:54 AM, Witold E Wolski <wewolski at gmail.com> wrote:
>>
>> Hi Bert,
>>
>> You are most likely right. I just thought that do.call("rbind", is
>> somehow more clever and allocates the memory up front. My error. After
>> more searching I did find rbind.fill from plyr which seems to do the
>> job (it computes the size of the result data.frame and allocates it
>> first).
>>
>> best
>>
>> On 27 June 2016 at 18:49, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>>> The following might be nonsense, as I have no understanding of R
>>> internals; but ....
>>>
>>> "Growing" structures in R by iteratively adding new pieces is often
>>> warned to be inefficient when the number of iterations is large, and
>>> your rbind() invocation might fall under this rubric. If so, you might
>>> try  issuing the call say, 20 times, over 10k disjoint subsets of the
>>> list, and then rbinding up the 20 large frames.
>>>
>>> Again, caveat emptor.
>>>
>>> Cheers,
>>> Bert
>>>
>>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at gmail.com> wrote:
>>>> I have a list (variable name data.list) with approx 200k data.frames
>>>> with dim(data.frame) approx 100x3.
>>>>
>>>> a call
>>>>
>>>> data <-do.call("rbind", data.list)
>>>>
>>>> does not complete - run time is prohibitive (I killed the rsession
>>>> after 5 minutes).
>>>>
>>>> I would think that merging data.frame's is a common operation. Is
>>>> there a better function (more performant) that I could use?
>>>>
>>>> Thank you.
>>>> Witold
>>>>