[R] large dataset import, aggregation and reshape

Sun Apr 24 21:17:40 CEST 2005

Dear useRs

We have a data-set (comma delimited) with 12Millions of rows, and 5 
columns (in fact many more, but we need only 4 of them): id, factor 'a' 
(5 levels), factor 'b' (15 levels), date-stamp, numeric measurement. We 
run R on suse-linux 9.1 with 2GB RAM, (and a 3.5GB swap file).

on average we have 30 obs. per id. We want to aggregate (eg. sum of the 
measuresments under each factor-level of 'a' and the same for factor 
'b') and reshape the data so that for each id we have only one row in 
the final data.frame, means finally we have roughly 400000 lines.

I tried read.delim, used the nrows argument, defined colClasses (with an 
as.Date class) - memory problems at the latests when calling reshape and 
aggregate. Also importing the date column as character and then 
converting the dates column using 'as.Date' didn't succeed.

It seems the problematic, memory intesive parts are:
a) importing the huge data per se (but the data with dim c(12,5) << 2GB?)
b) converting the time-stamp to a 'Date' class
c) aggregate and reshape task

What are the steps you would recommend?

(i) using scan, instead of read.delim (with or without colClasses?)
(ii) importing blocks of data (eg 1Million lines once), aggregating 
them, importing the next block, so on?
(iii) putting the data into a MySQL database, importing from there and 
doing the reshape and aggregation in R for both factors separately

thanks for hints from your valuable experience
cheers
christoph