[R] advice on big panel operations with mclapply?

Wed Jul 3 07:52:10 CEST 2013

dear R experts:  I have a very large panel data set, about 2-8GB.  think

NU <- 30000;NT <- 3000
ds <- data.frame( unit= rep(1:NU, each=NT ), time=NA,  x=NA)
ds$time <- rep( 1:NT, NU )
ds$x <- rnorm(nrow(ds))

I want to do a couple of operations within each unit first, and then
do some list operations at each time.  not difficult in principle.
think

  ds <- merge back in results of  mclapply( split(1:nrow(x), ds$unit),
function( ids ) { work on ds[ids,] } )  # same unit
  ds <- merge back in results of  mclapply( split(1:nrow(x), ds$time),
function( ids ) { work on ds[ids,] } )  # same time

the problem is that ds is big.  I can store 1 copy, but not 4.  what I
really want is to declare ds "read-only shared memory" before the
mclapply() and have the spawned processes access the same ds.  right
now, each core wants its own private duplicate of ds, which then runs
out of memory.  I don't think shared data is possible in R across
mclapply.

* I could just run my code single-threaded.  this loses the
parallelism of the task, but the code remains parsimonious and the
memory footprint is still ok.

* I could just throw 120GB of SSD as swapfile.  for $100 or so, this
ain't a bad solution.  its slower than RAM but faster and safer than
coding more complex R solutions.  it's still likely faster than
single-threaded operations on quad-core machines.  if the swap
algorithm is efficient, it shouldn't be so bad.

* I could pre-split the data before and merge after the mclapply.
within each chunk, I could then use mclapply.  the code would be
uglier and have a layer of extra complexity ( = bugs ), but RAM
consumption drops by orders of magnitude.  I am thinking something
roughly like

## first operation
mclapply( split(1:nrow(x), ds$units), function(di) save( ds[di,],
file=paste0("@", di, ".Rdata") )
rm(ds)  ## make space for the mclapply
results <- mclapply( Sys.glob("*.Rdata"), function( ids ) { load(ids);
...do whatever... }  ## run many many happysmall-mem processes
system("rm @*.Rdata")  ## temporary files
load("ds.Rdata")  ## since we deleted it, we have to reload the original data
## combine results of the full ds
ds <- data.frame( ds, results )
## now run the second operation on the time units

* I could dump the data into a data base, but then every access (like
the split() or the mclapply()) would also have to query and reload the
data again, just like my .Rdata files.  is it really faster/better
than abusing the file system and R's native file formats?  I doubt it,
but I don't know for sure.

this is a reasonably common problem with large data sets.  I saw some
specific solutions on stackoverflow, a couple requiring even less
parsimonious user code.  is everyone using bigmemory?  or SQL? or ...
?  I am leaning towards the SSD solution.  am I overlooking some
simpler recommended solution?

/iaw

----
Ivo Welch (ivo.welch at gmail.com)