[R] the large dataset problem

Mon Jul 30 17:46:10 CEST 2007

Hi Eric,

I'm facing a similar problem.

Looking over the list of packages I came across:

 	R.huge: Methods for accessing huge amounts of data 
 	http://cran.r-project.org/src/contrib/Descriptions/R.huge.html

I haven't installed it yet so I don't know how well it works.  I
probably won't have time until next week at the earliest to look at it.

Would be interested in hearing your feedback if you do try it.

- Bruce

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Eric Doviak
Sent: Saturday, July 28, 2007 2:08 PM
To: r-help at stat.math.ethz.ch
Subject: [R] the large dataset problem

Dear useRs,

I recently began a job at a very large and heavily bureaucratic
organization. We're setting up a research office and statistical
analysis will form the backbone of our work. We'll be working with large
datasets such the SIPP as well as our own administrative data.

Due to the bureaucracy, it will take some time to get the licenses for
proprietary software like Stata. Right now, R is the only statistical
software package on my computer. 

This, of course, is a huge limitation because R loads data directly into
RAM making it difficult (if not impossible) to work with large datasets.
My computer only has 1000 MB of RAM, of which Microsucks Winblows
devours 400 MB. To make my memory issues even worse, my computer has a
virus scanner that runs everyday and I do not have the administrative
rights to turn the damn thing off. 

I need to find some way to overcome these constraints and work with
large datasets. Does anyone have any suggestions?

I've read that I should "carefully vectorize my code." What does that
mean ??? !!!

The "Introduction to R" manual suggests modifying input files with Perl.
Any tips on how to get started? Would Perl Data Language (PDL) be a good
choice?  http://pdl.perl.org/index_en.html

I wrote a script which loads large datasets a few lines at a time,
writes the dozen or so variables of interest to a CSV file, removes the
loaded data and then (via a "for" loop) loads the next few lines .... I
managed to get it to work with one of the SIPP core files, but it's
SLOOOOW. Worse, if I discover later that I omitted a relevant variable,
then I'll have to run the whole script all over again.

Any suggestions?

Thanks,
- Eric

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

**********************************************************************
Please be aware that, notwithstanding the fact that the pers...{{dropped}}