[R] Selecting ranges of dates from a dataframe

David Winsemius dwinsemius at comcast.net
Fri Mar 11 14:58:52 CET 2011


On Mar 11, 2011, at 8:41 AM, Benjamin Stier wrote:

> Hi Francisco,
>
> Thanks for your solution. It runs pretty fast compared to my for  
> loop. Here
> is a comparison of system.time():
>
> system.time(splitVals <- by(serv, dates, aggregateDf ))
>   user  system elapsed
>  1.129   0.218   1.348
>
> system.time(... my long for loop...)
>   user  system elapsed
> 276.987   1.544 278.698
>
>
> I also tried Davids solution with "aggregate", but I can't get it to  
> work
> because I have to add as.numeric() into the sum(), since the data is  
> very big.

This comment doesn't make any sense. Unless you have character vectors  
that because of malformed values need coercion (which was NOT part of  
the example posed) then `sum` should not need any pre-processing or  
post-processing with `as.numeric`.

 > serv <- read.delim("cut.inp")
 > serv$datum <- strptime(serv$datum,  "%Y-%m-%d %H:%M:%S")
 > dates.serv <- unique(strptime(serv$datum, format="%Y-%m-%d"))
 > aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y- 
%m-%d")), sum)
      Group.1    read    write
1 2011-01-29 1021439 11726356
2 2011-01-30 1089534  4634910

Perhaps what you really needed was to read the file with colClasses to  
define the date-time and numeric fields properly. Try this:

serv <- read.delim("cut.inp", colClasses=c("POSIXct", "integer",  
"integer", "numeric","numeric") )
aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y-%m- 
%d")), sum)





> I will now try to understand how the by()-function works and what it  
> does.
> Thanks again for helping me!

If you read the help(tapply) page you are told that both `by` and  
`aggregate` are just convenience functions using tapply "under the  
hood".

>
> Regards,
>
> Benjamin
>
>
> On Thu, Mar 10, 2011 at 04:26:57PM +0000, Francisco Gochez wrote:
>> Benjamin,
>>
>> A more elegant "R-style" solution would be to use one of R's "apply"/
>> aggregation routines, of which there are many. For example, the  
>> "by" function
>> can split a data.frame by some factor/categorical variable(s), and  
>> then apply a
>> function to each "slice".  The result can then be pieced back  
>> together.  See
>> below for an example in which this factor is simply a parallel  
>> vector of pure
>> dates:
>>
>> # extract pure date component of time and date
>> dates <- format(serv$datum, "%Y-%m-%d")
>>
>> # write auxilliary function to aggregate a "slice" of the data.frame
>> # x will be a "slice" of data from a single day
>> aggregateDf <- function(x)
>> {
>>     # return a one-row data.frame
>>     data.frame(datum = format(x$datum[1], "%Y-%m-%d"), write = sum(x 
>> $write),
>> read = sum(x$read) )
>> }
>>
>> # now process each "slice" of the serv data.frame using "by"
>> splitVals <- by(serv, dates, aggregateDf )
>>
>> # bind back into a single data.frame
>> values <- do.call(rbind, splitVals)
>>
>>
>> The difference in execution speed is pretty negligible on my  
>> machine, so it's a
>> more concise solution but I don't know if it is much faster.
>>
>> HTH,
>>
>> Francisco
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list