[R] the large dataset problem

Tue Jul 31 13:22:23 CEST 2007

Just a note of thanks for all the help I have received. I haven't gotten a chance to implement any of your suggestions because I'm still trying to catalog all of them! Thank you so much!

Just to recap (for my own benefit and to create a summary for others):

Bruce Bernzweig suggested using the  R.huge  package.

Ben Bolker pointed out that my original message wasn't clear and asked what I want to do with the data. At this point, just getting a dataset loaded would be wonderful, so I'm trying to trim variables (and if possible, I would also like to trim observations). He also provided an example of "vectorizing."

Ted Harding suggested that I use AWK to process the data and provided the necessary code. He also tested his code on older hardware running GNU-Linux (or Unix?) and showed that AWK can process the data even when the computer has very little memory and processing power. Jim Holtman had similar success when he used Cygwin's UNIX utilities on a machine running MS Windows. They both used the following code:

     gawk 'BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) ","  $(5678)}'
     < tempxx.txt > newdata.csv

Fortunately, there is a version of GAWK for MS Windows. ... Not that I like MS Windows. It's just that I'm forced to use that 19th century operating system on the job. (After using Debian at home and happily running RKWard for my dissertation, returning to Windows World is downright depressing). 

Roland Rau suggested that I use a database with RSQLite and pointed out that RODBC can work with MS Access. He also pointed me to a sub-chapter in Venables and Ripley's _S Programming_ and "The Whole-Object View" pages in John Chamber's _Programming with Data_. 

Greg Snow recommended  biglm  for regression analysis with data that is too large to fit into memory.

Last, but not least, Peter Dalgaard pointed out that there are options within R. He suggests using the colClasses= argument for when "reading" data and the what= argument for "scanning" data, so that you don't load more columns than necessary. He also provided the following script: 

     dict <- readLines("ftp://www.sipp.census.gov/pub/sipp/2004/l04puw1d.txt")
     D.lines <- grep("^D ", dict)
     vdict <- read.table(con <- textConnection(dict[D.lines])); close(con)
     head(vdict) 

I'll try these solutions and report back on my success.

Thanks again!
- Eric