[R] the large dataset problem

Mon Jul 30 18:42:59 CEST 2007

Eric Doviak <edoviak <at> earthlink.net> writes:

> 
> Dear useRs,
> 
> I recently began a job at a very large and heavily bureaucratic organization.
We're setting up a research
> office and statistical analysis will form the backbone of our work. We'll be
working with large datasets
> such the SIPP as well as our own administrative data.

  We need to know more about what you need to do with those
large data sets in order to help -- giving some specific
examples would be useful.  In many situations you can set up a database
connection or use Perl to select carefully and only load the
observations/variables you need into R, but it's hard to make
completely general suggestions.  

  I'm not sure what the purpose of your code to read a few
lines of a data file and write it to a CSV file is ... ?

  "Vectorizing" your code is figuring out a way to tell R
how to do what you want as a single 'vector' operation -- for
example to remove NAs from a vector you could do this:

newvec = numeric(0)
for (i in seq(along=oldvec)) {
  if (!is.na(oldvec[i])) newvec = c(newvec,oldvec[i])
}

but this would be incredibly slow --

newvec = oldvec[!is.na(oldvec)]

or

newvec = na.omit(oldvec)

would be far faster.