[Rd] Rapid Random Access

Fri Dec 14 19:01:30 CET 2007

  I have some code that can potentially produce a huge number of 
large-ish R data frames, each of a different number of rows. All the 
data frames together will be way too big to keep in R's memory, but 
we'll assume a single one is manageable. It's just when there's a 
million of them that the machine might start to burn up.

  However I might, for example, want to compute some averages over the 
elements in the data frames. Or I might want to sample ten of them at 
random and do some plots. What I need is rapid random access to data 
stored in external files.

  Here's some ideas I've had:

  * Store all the data in an HDF-5 file - problem here is that the 
current HDF package for R reads the whole file in at once.

  * Store the data in some other custom binary format with an index for 
rapid access to the N-th elements. Problems: feels like reinventing HDF, 
cross-platform issues, etc.

  * Store the data in a number of .RData files in a directory. Hence to 
get the N-th element just attach(paste("foo/A-",n,'.RData')) give or 
take a parameter or two.

  * Use a database. Seems a bit heavyweight, but maybe using RSQLite 
could work in order to keep it local.

  What I'm currently doing is keeping it OO enough that I can in theory 
implement all of the above. At the moment I have an implementation that 
does keep them all in R's memory as a list of data frames, which is fine 
for small test cases but things are going to get big shortly. Any other 
ideas or hints are welcome.

thanks

Barry