[R] Large data sets with R (binding to hadoop available?)

Thomas Lumley tlumley at u.washington.edu
Fri Aug 22 18:10:55 CEST 2008


On Thu, 21 Aug 2008, Roland Rau wrote:
> Hi
>
> Avram Aelony wrote: (in part)
>> 
>> 1. How do others handle situations of large data sets (gigabytes, 
>> terabytes) for analysis in R ?
>> 
> I usually try to store the data in an SQLite database and interface via 
> functions from the packages RSQLite (and DBI).
>
> No idea about Question No. 2, though.
>
> Hope this helps,
> Roland
>
>
> P.S. When I am sure that I only need a certain subset of large data sets, I 
> still prefer to do some pre-processing in awk (gawk).
> 2.P.S. The size of my data sets are in the gigabyte range (not terabyte 
> range). This might be important if your data sets are *really large* and you 
> want to use sqlite: http://www.sqlite.org/whentouse.html
>

I use netCDF for (genomic) datasets in the 100Gb range, with the ncdf 
package, because SQLite was too slow for the sort of queries I needed. 
HDF5 would be another possibility; I'm not sure of the current status of 
the HDF5 support in Bioconductor, though.

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle



More information about the R-help mailing list