[R] large dataset import, aggregation and reshape

Renaud Lancelot renaud.lancelot at cirad.fr
Mon Apr 25 00:28:21 CEST 2005


Christoph Lehmann a écrit :
> Dear useRs
> 
> We have a data-set (comma delimited) with 12Millions of rows, and 5 
> columns (in fact many more, but we need only 4 of them): id, factor 'a' 
> (5 levels), factor 'b' (15 levels), date-stamp, numeric measurement. We 
> run R on suse-linux 9.1 with 2GB RAM, (and a 3.5GB swap file).
> 
> on average we have 30 obs. per id. We want to aggregate (eg. sum of the 
> measuresments under each factor-level of 'a' and the same for factor 
> 'b') and reshape the data so that for each id we have only one row in 
> the final data.frame, means finally we have roughly 400000 lines.
> 
> I tried read.delim, used the nrows argument, defined colClasses (with an 
> as.Date class) - memory problems at the latests when calling reshape and 
> aggregate. Also importing the date column as character and then 
> converting the dates column using 'as.Date' didn't succeed.
> 
> It seems the problematic, memory intesive parts are:
> a) importing the huge data per se (but the data with dim c(12,5) << 2GB?)
> b) converting the time-stamp to a 'Date' class
> c) aggregate and reshape task
> 
> What are the steps you would recommend?
> 
> (i) using scan, instead of read.delim (with or without colClasses?)
> (ii) importing blocks of data (eg 1Million lines once), aggregating 
> them, importing the next block, so on?
> (iii) putting the data into a MySQL database, importing from there and 
> doing the reshape and aggregation in R for both factors separately
> 
> thanks for hints from your valuable experience
> cheers
> christoph
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 

I would try the latter and use and SQL interface such as RODBC or 
RMySQL. You can send your aggregation and reshape commands to the 
external database as an SQL query.

Example with a database I have at hand. The table "datemesu" has 640,000 
rows and 5 columns, the field "mesure" being a factor with 2 levels, "N" 
and "P".

 > library(RODBC)
 > fil <- "C:/Archives/Baobab/Baobab2000.mdb"
 > chann <- odbcConnectAccess(fil)
 > quer <- paste("SELECT numani, SUM(IIF(mesure = 'P', 1, 0)) AS wt,",
+                      "SUM(IIF(mesure = 'N', 1, 0)) AS bcs,",
+                      "MIN(date) AS minDate",
+               "FROM datemesu",
+               "GROUP BY numani")
 > system.time(tab <- sqlQuery(chann, quer), gcFirst = TRUE)
[1] 11.16  0.19 11.54    NA    NA
 > odbcCloseAll()
 >
 > dim(tab)
[1] 69360     4
 > head(tab)
        numani wt bcs    minDate
1 SNFLCA00001  1   0 1987-01-23
2 SNFLCA00002  2   0 1987-01-10
3 SNFLCA00004  1   0 1987-01-10
4 SNFLCA00006  4   0 1987-02-02
5 SNFLCA00007  4   0 1987-02-18
6 SNFLCA00008  3   0 1987-01-09


Best,

Renaud


-- 
Dr Renaud Lancelot, vétérinaire
C/0 Ambassade de France - SCAC
BP 834 Antananarivo 101 - Madagascar

e-mail: renaud.lancelot at cirad.fr
tel.:   +261 32 40 165 53 (cell)
         +261 20 22 665 36 ext. 225 (work)
         +261 20 22 494 37 (home)




More information about the R-help mailing list