[R] "Best" way to merge 300+ .5MB dataframes?

Mon Aug 11 00:24:00 CEST 2014

On Aug 10, 2014, at 11:51 AM, Grant Rettke wrote:

> Good afternoon,
> 
> Today I was working on a practice problem. It was simple, and perhaps
> even realistic. It looked like this:
> 
> • Get a list of all the data files in a directory
> • Load each file into a dataframe
> • Merge them into a single data frame

Something along these lines:

all <- do.call( rbind, 
                 lapply( list.files(path=getwd(), pattern=".csv"), 
                         read.csv) )

Possibly:

all <- sapply( list.files(path=getwd(), pattern=".csv"), 
                         read.csv)

Untested since no reproducible example was offered. This skips the task of individually assigning names to the input dataframes. There are quite a few variations on this in the Archives. You should learn to search them. Rseek.org or MarkMail are effective for me.

http://www.rseek.org/

http://markmail.org/search/?q=list%3Aorg.r-project.r-help

> 
> Because all of the columns were the same, the simplest solution in my
> mind was to `Reduce' the vector of dataframes with a call to
> `merge'. That worked fine, I got what was expected. That is key
> actually. It is literally a one-liner, and there will never be index
> or scoping errors with it.

You might have forced `merge` to work with the correct choice of arguments but I would have silently eliminated duplicate rows. Seems unlikely to me that it would be efficient for the purpose of just stacking dataframe values.
> 
> > merge( data.frame(a=1, b=2), data.frame(a=3, b=4) )
[1] a b
<0 rows> (or 0-length row.names)

> merge( data.frame(a=1, b=2), data.frame(a=3, b=4) , all=TRUE)
  a b
1 1 2
2 3 4
> merge( data.frame(a=1, b=2), data.frame(a=1, b=2) )
  a b
1 1 2

> rbind( data.frame(a=1, b=2), data.frame(a=1, b=2) )
  a b
1 1 2
2 1 2

> Now with that in mind, what is the idiomatic way? Do people usually do
> something else because it is /faster/ (by some definition)?
> 
> Kind regards,
> 

-- 

David Winsemius
Alameda, CA, USA