[R] Using huge datasets
Roger D. Peng
rpeng at jhsph.edu
Wed Feb 4 17:18:45 CET 2004
By my calculation, your dataset should occupy less than
400MB of RAM, so this is not a terribly large dataset (these
days). But that's not including any possible attributes
(like row names) which often also take up a lot of memory.
Considering that a function like read.csv() makes a copy of
the dataset your actual requirements are ~800MB, which for a
1GB machine may be too big depending on what else the
computer is doing. I have successfully loaded *much* bigger
datasets into R (2-4GB) without a problem.
Some possible solutions are
1. Buy more RAM
2. Use scan(), which doesn't make a copy of the dataset
3. Use a 64-bit machine and buy even more RAM.
Using a cluster of computers doesn't really help in this
situation because there's no easy way to spread a dataset
across multiple machines. So you will still be limited by
the memory on a single machine.
As far as I know, R does not have a "memory limitation" --
the only limit is the memory installed on your computer.
Fabien Fivaz wrote:
> Here is what I want to do. I have a dataset containing 4.2 *million*
> rows and about 10 columns and want to do some statistics with it, mainly
> using it as a prediction set for GAM and GLM models. I tried to load it
> from a csv file but, after filling up memory and part of the swap (1 gb
> each), I get a segmentation fault and R stops. I use R under Linux. Here
> are my questions :
> 1) Has anyone ever tried to use such a big dataset?
> 2) Do you think that it would possible on a more powerfull machine, such
> as a cluster of computers?
> 3) Finaly, does R has some "memory limitation" or does it just depend on
> the machine I'm using?
> Best wishes
> Fabien Fivaz
> R-help at stat.math.ethz.ch mailing list
> PLEASE do read the posting guide!
More information about the R-help