[R] gc() and memory efficiency

Tue Feb 5 03:36:42 CET 2008

On 4 February 2008 at 20:45, Doran, Harold wrote:
| I have a program which reads in a very large data set, performs some analyses, and then repeats this process with another data set. As soon as the first set of analyses are complete, I remove the very large object and clean up to try and make memory available in order to run the second set of analyses. The process looks something like this:
| 
| 1) read in data set 1 and perform analyses
| rm(list=ls())
| gc()
| 2) read in data set 2 and perform analyses
| rm(list=ls())
| gc()
| ...
| 
| But, it appears that I am not making the memory that was consumed in step 1 available back to the OS as R complains that it cannot allocate a vector of size X as the process tries to repeat in step 2. 
| 
| So, I close and reopen R and then drop in the code to run the second analysis. When this is done, I close and reopen R and run the third analysis. 
| 
| This is terribly inefficient. Instead I would rather just source in the R code and let the analyses run over night.
| 
| Is there a way that I can use gc() or some other function more efficiently rather than having to close and reopen R at each iteration?

I haven't found one. 

Every (trading) I process batches of data with R, and the only reliable way I
have found is to use fresh R sessions.  Otherwise, the fragmented memory will
eventually result in the all-too-familiar 'cannot allocate X mb' for rather
small values of X relative to my total ram. C'est la vie.

As gc() seems to help somewhat yet not 'sufficiently', fresh starts are an
alternative help, And Rscript starts faster than the main R. Now, I happen to
be partial to littler [1] which starts even faster, so I use that ( on Linux
and am not sure if it can be built on Windows as we embed R directly and
hence start faster than Rscript).  But either one should help you with some
batch files -- given you a way to run overnight.  And once you start batching
things, it is only a small step to regain efficiency by parallel execution
using something like MPI or NWS

Hth, Dirk

[1] littler is the predecessor to Rscript by Jeff and myself. See either 
	http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/LittleR
    or 
	http://dirk.eddelbuettel.com/code/littler.html
    for more on littler and feel free to email us.

-- 
Three out of two people have difficulties with fractions.