[R] aggregate slow with variables of type 'dates' - how to solve

Christoph Lehmann christoph.lehmann at gmx.ch
Sat Apr 16 01:22:34 CEST 2005


Dear all
I use aggregate with variables of type numeric and dates. For type numeric  
functions, such as sum() are very fast, but similar simple functions, such 
as min() are much slower for the variables of type 'dates'. The difference 
gets bigger the larger the 'id' var is - but see this sample code:

dts <- dates(c("02/27/92", "02/27/92", "01/14/92",
               "02/28/92", "02/01/92"))
ntimes <- 700000
dts <- data.frame(rep(c(1:40), ntimes/8), 
                  chron(rep(dts, ntimes), format = c(dates = "m/d/y")),
                  rep(c(0.123, 0.245, 0.423, 0.634, 0.256), ntimes))
names(dts) <- c("id", "date", "tbs")


date()
dat.1st <- aggregate(dts$date, list(id = dts$id), min)$x
dat.1st <- chron(dat.1st, format = c(dates = "m/d/y"))     
dat.1st
date() #82 seconds


date()
tbs.s <- aggregate(as.numeric(dts$tbs),list(id = dts$id), sum)
tbs.s
date() #17 seconds

--- is it a problem of data-type 'dates' ? if yes, is there any solution 
to solve this, since for huge data-sets, this can be a problem...

as I mentioned, e.g. if we have for variable 'id' eg just 5 levels, the 
two times are roughly the same, but with the 40 different ids, we have 
this big difference

thanks a lot

Christoph

--




More information about the R-help mailing list