[R] Selecting ranges of dates from a dataframe

Fri Mar 11 14:41:12 CET 2011

Hi Francisco,

Thanks for your solution. It runs pretty fast compared to my for loop. Here
is a comparison of system.time():

system.time(splitVals <- by(serv, dates, aggregateDf ))
   user  system elapsed 
  1.129   0.218   1.348 

system.time(... my long for loop...)
   user  system elapsed 
276.987   1.544 278.698

I also tried Davids solution with "aggregate", but I can't get it to work
because I have to add as.numeric() into the sum(), since the data is very big.
I will now try to understand how the by()-function works and what it does.
Thanks again for helping me!

Regards,

Benjamin

On Thu, Mar 10, 2011 at 04:26:57PM +0000, Francisco Gochez wrote:
> Benjamin,
> 
> A more elegant "R-style" solution would be to use one of R's "apply"/
> aggregation routines, of which there are many. For example, the "by" function
> can split a data.frame by some factor/categorical variable(s), and then apply a
> function to each "slice".  The result can then be pieced back together.  See
> below for an example in which this factor is simply a parallel vector of pure
> dates:
> 
> # extract pure date component of time and date
> dates <- format(serv$datum, "%Y-%m-%d")
> 
> # write auxilliary function to aggregate a "slice" of the data.frame
> # x will be a "slice" of data from a single day
> aggregateDf <- function(x)
> {
>     # return a one-row data.frame
>     data.frame(datum = format(x$datum[1], "%Y-%m-%d"), write = sum(x$write),
> read = sum(x$read) )
> }
> 
> # now process each "slice" of the serv data.frame using "by"
> splitVals <- by(serv, dates, aggregateDf )
> 
> # bind back into a single data.frame
> values <- do.call(rbind, splitVals)
> 
> 
> The difference in execution speed is pretty negligible on my machine, so it's a
> more concise solution but I don't know if it is much faster.
> 
> HTH,
> 
> Francisco