[R] the large dataset problem

Greg Snow Greg.Snow at intermountainmail.org
Tue Jul 31 00:43:35 CEST 2007

Check out the biglm package for some tools that may be useful.

-----Original Message-----
From: "Eric Doviak" <edoviak at earthlink.net>
To: "r-help at stat.math.ethz.ch" <r-help at stat.math.ethz.ch>
Sent: 7/30/07 9:54 AM
Subject: [R] the large dataset problem

Dear useRs,

I recently began a job at a very large and heavily bureaucratic organization. We're setting up a research office and statistical analysis will form the backbone of our work. We'll be working with large datasets such the SIPP as well as our own administrative data.

Due to the bureaucracy, it will take some time to get the licenses for proprietary software like Stata. Right now, R is the only statistical software package on my computer. 

This, of course, is a huge limitation because R loads data directly into RAM making it difficult (if not impossible) to work with large datasets. My computer only has 1000 MB of RAM, of which Microsucks Winblows devours 400 MB. To make my memory issues even worse, my computer has a virus scanner that runs everyday and I do not have the administrative rights to turn the damn thing off. 

I need to find some way to overcome these constraints and work with large datasets. Does anyone have any suggestions?

I've read that I should "carefully vectorize my code." What does that mean ??? !!!

The "Introduction to R" manual suggests modifying input files with Perl. Any tips on how to get started? Would Perl Data Language (PDL) be a good choice?  http://pdl.perl.org/index_en.html

I wrote a script which loads large datasets a few lines at a time, writes the dozen or so variables of interest to a CSV file, removes the loaded data and then (via a "for" loop) loads the next few lines .... I managed to get it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I discover later that I omitted a relevant variable, then I'll have to run the whole script all over again.

Any suggestions?

- Eric

R-help at stat.math.ethz.ch mailing list
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list