[R] the large dataset problem

Roland Rau roland.rproject at gmail.com
Mon Jul 30 21:34:30 CEST 2007


Eric Doviak wrote:
> 
> I need to find some way to overcome these constraints and work with large datasets. Does anyone have any suggestions?
I might be not the most authoritative person on this subject but I put 
all my large datasets[1] into an SQLite database and extract/summarize 
data from it with R using the RSQLite package. If your data come in 
ASCII format, it is rather easy to read them into an SQLite DB.

> 
> I've read that I should "carefully vectorize my code." What does that mean ??? !!!
The book "S Programming" by Venables & Ripley has a sub-chapter on this.
If you happen to have John Chamber's "Programming with Data" book, there 
are a few pages on "The Whole-Object View".

> 
> I wrote a script which loads large datasets a few lines at a time, writes the dozen or so variables of interest to a CSV file, removes the loaded data and then (via a "for" loop) loads the next few lines .... I managed to get it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I discover later that I omitted a relevant variable, then I'll have to run the whole script all over again.
> 
That means you have huge datasets but you never need the whole dataset? 
Just a selected number of variables and then the files are of managable 
size?
If this is the case, using RSQLite (or any other DB package, also RODBC 
is very easy to use, if you have, for example, an MS Access DB) is a 
good option. Alternatively, are you familiar with some old-fashioned 
Unix-Tools? Ports for MS Windows also exist and the program 'cut' could 
help you considerably.


Please note:
- I am only a causal user of the DB interfaces. So there might be better 
solutions and people with more detailed knowledge about it.
- All the tools I mentioned here are licensed under the same or similar 
free software licenses as R, so you should have no problems 
obtaining/installing them.
- A good source of information is the R Data Import/Export Manual -- 
shipped with every R distribution and available online at 
http://cran.at.r-project.org/doc/manuals/R-data.html

I hope this helps,
Roland


[1] The largest one is 1GB -- so probably not really large.



More information about the R-help mailing list