[R] the large dataset problem

Mon Jul 30 13:40:47 CEST 2007

Dear useRs,

I recently began a job at a very large and heavily bureaucratic organization. We're setting up a research office and statistical analysis will form the backbone of our work. We'll be working with large datasets such the SIPP as well as our own administrative data.

Due to the bureaucracy, it will take some time to get the licenses for proprietary software like Stata. Right now, R is the only statistical software package on my computer. 

This, of course, is a huge limitation because R loads data directly into RAM making it difficult (if not impossible) to work with large datasets. My computer only has 1000 MB of RAM, of which Microsucks Winblows devours 400 MB. To make my memory issues even worse, my computer has a virus scanner that runs everyday and I do not have the administrative rights to turn the damn thing off. 

I need to find some way to overcome these constraints and work with large datasets. Does anyone have any suggestions?

I've read that I should "carefully vectorize my code." What does that mean ??? !!!

The "Introduction to R" manual suggests modifying input files with Perl. Any tips on how to get started? Would Perl Data Language (PDL) be a good choice?  http://pdl.perl.org/index_en.html

I wrote a script which loads large datasets a few lines at a time, writes the dozen or so variables of interest to a CSV file, removes the loaded data and then (via a "for" loop) loads the next few lines .... I managed to get it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I discover later that I omitted a relevant variable, then I'll have to run the whole script all over again.

Any suggestions?

Thanks,
- Eric