[R] performance of do.call("rbind")

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Mon Jun 27 19:00:32 CEST 2016


Your description of the data frames as "approx" puts the solution to considerable difficulty and speed penalty. If you want better performance you need a better handle on the data you are working with. 

For example, if you knew that every data frame had exactly three columns named identically and exactly 100 rows, then you could preallocate the result data frame and loop through the input data copying values directly to the appropriate destination locations in the result. 

To the extent that you can figure out things like the union of all column names or the total number of rows prior to starting copying data, you can adapt the above approach even if the input data frames are not identical. The key is not having to restructure/reallocate your result data frame as you go. 

The bind_rows function in the dplyr package can do a lot of this for you... but being a general-purpose function it may not be as optimized as you could do yourself with better knowledge of your data. 
-- 
Sent from my phone. Please excuse my brevity.

On June 27, 2016 8:51:17 AM PDT, Witold E Wolski <wewolski at gmail.com> wrote:
>I have a list (variable name data.list) with approx 200k data.frames
>with dim(data.frame) approx 100x3.
>
>a call
>
>data <-do.call("rbind", data.list)
>
>does not complete - run time is prohibitive (I killed the rsession
>after 5 minutes).
>
>I would think that merging data.frame's is a common operation. Is
>there a better function (more performant) that I could use?
>
>Thank you.
>Witold



More information about the R-help mailing list