[R] Calculate daily means from 5-minute interval data

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Wed Sep 1 01:51:09 CEST 2021


Never use stringsAsFactors on uncleaned data. For one thing you give a factor to as.Date and it tries to make sense of the integer representation, not the character representation.

library(dplyr)
dta <- read.csv( text =
"sampdate,samptime,cfs
2020-08-26,09:30,136000
2020-08-26,09:35,126000
2020-08-26,09:40,130000
2020-08-26,09:45,128000
2020-08-26,09:50,126000
2020-08-26,09:55,125000
2020-08-26,10:00,121000
2020-08-26,10:05,117000
2020-08-26,10:10,120000
", stringsAsFactors = FALSE)
dtad <- (   dta
        %>% group_by( sampdate )
        %>% summarise( exp_value = mean(cfs, na.rm = TRUE)
                     , Count = n()
                     )
        )

On August 31, 2021 2:11:05 PM PDT, Rich Shepard <rshepard using appl-ecosys.com> wrote:
>On Sun, 29 Aug 2021, Jeff Newmiller wrote:
>
>> The general idea is to create a "grouping" column with repeated values for
>> each day, and then to use aggregate to compute your combined results. The
>> dplyr package's group_by/summarise functions can also do this, and there
>> are also proponents of the data.table package which is high performance
>> but tends to depend on altering data in-place unlike most other R data
>> handling functions.
>
>Jeff,
>
>I've read a number of docs discussing dplyr's summerize and group_by
>functions (including that section of Hadley's 'R for Data Science' book, yet
>I'm missing something; I think that I need to separate the single sampdate
>column into colums for year, month, and day and group_by year/month
>summarizing within those groups.
>
>The data are of this format:
>sampdate,samptime,cfs
>2020-08-26,09:30,136000
>2020-08-26,09:35,126000
>2020-08-26,09:40,130000
>2020-08-26,09:45,128000
>2020-08-26,09:50,126000
>2020-08-26,09:55,125000
>2020-08-26,10:00,121000
>2020-08-26,10:05,117000
>2020-08-26,10:10,120000
>
>My curent script is:
>
>-------8<--------------
>library('tidyverse')
>
>discharge <- read.table('../data/discharge.dat', header = TRUE, sep = ',', stringsAsFactors = TRUE)
>discharge$sampdate <- as.Date(discharge$sampdate)
>discharge$cfs <- as.numeric(discharge$cfs, length = 6)
>
># use dplyr.summarize grouped by date
>
># need to separate sampdate into %Y-%M-%D in order to group_by the month?
>by_month <- discharge %>%
>   group_by(sampdate ...
>summarize(by_month, exp_value = mean(cfs, na.rm = TRUE), sd(cfs))
>---------------->8--------
>
>and the results are:
>
>> str(discharge)
>'data.frame':	93254 obs. of  3 variables:
>  $ sampdate: Date, format: "2020-08-26" "2020-08-26" ...
>  $ samptime: Factor w/ 728 levels "00:00","00:05",..: 115 116 117 118 123 128 133 138 143 148 ...
>  $ cfs     : num  176 156 165 161 156 154 144 137 142 142 ...
>> ls()
>[1] "by_month"  "discharge"
>> by_month
># A tibble: 93,254 × 3
># Groups:   sampdate [322]
>    sampdate   samptime   cfs
>    <date>     <fct>    <dbl>
>  1 2020-08-26 09:30      176
>  2 2020-08-26 09:35      156
>  3 2020-08-26 09:40      165
>  4 2020-08-26 09:45      161
>  5 2020-08-26 09:50      156
>  6 2020-08-26 09:55      154
>  7 2020-08-26 10:00      144
>  8 2020-08-26 10:05      137
>  9 2020-08-26 10:10      142
>10 2020-08-26 10:15      142
># … with 93,244 more rows
>
>I don't know why the discharge values are truncated to 3 digits when they're
>6 digits in the input data.
>
>Suggested readings appreciated,
>
>Rich
>
>______________________________________________
>R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.



More information about the R-help mailing list