[R] Large Dataset

Thomas Lumley tlumley at u.washington.edu
Wed Jan 7 10:52:44 CET 2009


There are several approaches to analyzing data sets much larger than memory, and the best approach does depend on the problem. It's certainly possible to process gigabytes of data on a 32-bit R system - the examples I've worked on are whole-genome association studies with 10^5-10^6 variables and 10^3-10^4 observations. Other people have worked with much larger data sets.

Some approaches are:

- incremental reading using a connection to the file, reading a few thousand lines at a time.  The statistics Edwin wants can all be computed in a single pass through the data. This is what the biglm package does for linear models.

- storing the data in a relational database and then either
    *) using SQL commands (the mean, min, max are all built in to SQL) to do
        most of the work and just reading results (or interim results) into R
    *) reading appropriate chunks of the data into R and doing the computations
        there

- storing the data in netCDF or HDF5 formats and loading chunks into R.  These are less flexible than relational databases but more efficient for certain sorts of subsets.

- memory-mapping the data file (the ff package does this) to read sections of data. I haven't tried this, so I'm not sort where its advantages and disadvantages are.


The bigmemory package does not address quite the same problem.  It deals with objects that fit in memory, but are large enough that copying them is a bad idea, and it also deals with sharing between processors.

       -thomas





On Tue, 6 Jan 2009, Simon Pickett wrote:

> Hi,
>
> I am not very knowledgeable about this kind of stuff but my guess is that if 
> you have a fairly slow computer and massive data sets there isnt alot you can 
> do except get a better computer, buy more RAM or use something like SAS 
> instead?
>
> Hopefully someone else will chip in Edwin, best of luck.
>
> Simon.
>
>
> ----- Original Message ----- From: "Edwin Sendjaja" <edwin7 at web.de>
> To: "Simon Pickett" <simon.pickett at bto.org>
> Cc: <r-help at r-project.org>
> Sent: Tuesday, January 06, 2009 2:53 PM
> Subject: Re: [R] Large Dataset
>
>
>> Hi Simon,
>> 
>> My RAM is only 3.2 GB (actually it should be 4 GB, but my Motherboard doesnt
>> support it.
>> 
>> R use almost of all my RAM and half of my swap. I think memory.limit will 
>> not
>> solve my problem.  It seems that I need  RAM.
>> 
>> Unfortunately, I can't buy more RAM.
>> 
>> Why R is slow reading big data set?
>> 
>> 
>> Edwin
>> 
>>> Only a couple of weeks ago I had to deal with this.
>>> 
>>> adjust the memory limit as follows, although you might not want 4000, that
>>> is quite high....
>>> 
>>> memory.limit(size = 4000)
>>> 
>>> Simon.
>>> 
>>> ----- Original Message -----
>>> From: "Edwin Sendjaja" <edwin7 at web.de>
>>> To: "Simon Pickett" <simon.pickett at bto.org>
>>> Cc: <r-help at r-project.org>
>>> Sent: Tuesday, January 06, 2009 12:24 PM
>>> Subject: Re: [R] Large Dataset
>>> 
>>> > Hi Simon,
>>> >
>>> > Thank for your reply.
>>> > I have read ?Memory but I dont understand how to use. I am not sure if
>>> > that
>>> > can solve my problem. Can you tell me more detail?
>>> >
>>> > Thanks,
>>> >
>>> > Edwin
>>> >
>>> >> type
>>> >>
>>> >> ?memory
>>> >>
>>> >> into R and that will explain what to do...
>>> >>
>>> >> S
>>> >> ----- Original Message -----
>>> >> From: "Edwin Sendjaja" <edwin7 at web.de>
>>> >> To: <r-help at r-project.org>
>>> >> Sent: Tuesday, January 06, 2009 11:41 AM
>>> >> Subject: [R] Large Dataset
>>> >>
>>> >> > Hi alI,
>>> >> >
>>> >> > I  have a 3.1 GB Dataset ( with  11 coloumns and lots data in int >> 
>>> > and
>>> >> > string).
>>> >> > If I use read.table; it takes very long. It seems that my RAM is not
>>> >> > big
>>> >> > enough (overload) I have 3.2 RAM and  7GB SWAP, 64 Bit Ubuntu.
>>> >> >
>>> >> > Is there a best sultion to read a large data R? I have seen, that
>>> >> > people
>>> >> > suggest to use bigmemory package, ff. But it seems very complicated.
>>> >> > I dont
>>> >> > know how to start with that packages.
>>> >> >
>>> >> > i have tried to use bigmemory. But I got some kind of errors.  Then 
>>> >> > I
>>> >> > gave up.
>>> >> >
>>> >> >
>>> >> > can someone give me an simple example how ot use ff or bigmemory?or
>>> >> > maybe
>>> >> > re
>>> >> > better sollution?
>>> >> >
>>> >> >
>>> >> >
>>> >> > Thank you in advance,
>>> >> >
>>> >> >
>>> >> > Edwin
>>> >> >
>>> >> > ______________________________________________
>>> >> > R-help at r-project.org mailing list
>>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> >> > PLEASE do read the posting guide
>>> >> > http://www.R-project.org/posting-guide.html
>>> >> > and provide commented, minimal, self-contained, reproducible code.
>> 
>> 
>> 
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list