[R] merging and working with big data sets

Jay Emerson jayemerson at gmail.com
Tue Oct 12 13:52:18 CEST 2010


I can't speak for ff and filehash, but bigmemory's data structure
doesn't allow "clever" merges (for actually good reasons).  However,
it is still probably less painful (and faster) than other options,
though we don't implement it: we leave it to the user because details
may vary depending on the example and the code is trivial.

- Allocate an empty new filebacked big.matrix of the proper size.
- Fill it in chunks (typically a column at a time if you can afford
the RAM overhead, or a portion of a column at a time).   Column
operations are more efficient than row operations (again, because of
the internals of the data structure).
- Because you'll be using filebackings, RAM limitations won't matter
other than the overhead of copying each chunk.

I should note: if you used separated=TRUE, each column would have a
separate binary file, and a "smart" cbind() would be possible simply
by manipulating the descriptor file.  Again, not something we advise
or formally provide, but it wouldn't be hard.

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay



More information about the R-help mailing list