[R] Using huge datasets

Liaw, Andy andy_liaw at merck.com
Wed Feb 4 17:11:59 CET 2004

A matrix of that size takes up just over 320MB to store in memory.  I'd
imagine you probably can do it with 2GB physical RAM (assuming your
`columns' are all numeric variables; i.e., no factors).

However, perhaps better way than the brute-force, one-shot way, is to read
in the data in chunks and do the prediction piece by piece.  You can use
scan(), or open()/readLines()/close() to do this fairly easily.

My understanding of how (most) clusters work is that you need at least one
node that will accommodate the memory load for the monolithic R process, so
probably not much help.  (I could very well be wrong about this.  If so, I'd
be very grateful for correction.)


> From: Fabien Fivaz
> Hi,
> Here is what I want to do. I have a dataset containing 4.2 *million* 
> rows and about 10 columns and want to do some statistics with 
> it, mainly 
> using it as a prediction set for GAM and GLM models. I tried 
> to load it 
> from a csv file but, after filling up memory and part of the 
> swap (1 gb 
> each), I get a segmentation fault and R stops. I use R under 
> Linux. Here 
> are my questions :
> 1) Has anyone ever tried to use such a big dataset?
> 2) Do you think that it would possible on a more powerfull 
> machine, such 
> as a cluster of computers?
> 3) Finaly, does R has some "memory limitation" or does it 
> just depend on 
> the machine I'm using?
> Best wishes
> Fabien Fivaz

Notice:  This e-mail message, together with any attachments,...{{dropped}}

More information about the R-help mailing list