# [R] Calculate daily means from 5-minute interval data

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Sun Sep 5 20:04:43 CEST 2021

```This problem nearly always boils down to using meta knowledge about the file. Having informal TZ info in the file is very helpful, but PST is not necessarily a uniquely-defined time zone specification so you have to draw on information outside of the file to know that these codes correspond to -0800 etc. (e.g. CST could be China Standard Time or US Central Standard Time.) Thus, it is tough to make this into a broadly-useful function.

You can also construct the timezone column from knowledge about the location of interest and the monotonicity of the time data. https://jdnewmil.github.io/eci298sp2016/QuickHowtos1.html#handling-time-data ... but the answer to "easy" seems firmly in the eyes of the beholder.

On September 5, 2021 10:18:48 AM PDT, Bill Dunlap <williamwdunlap using gmail.com> wrote:
>What is the best way to read (from a text file) timestamps from the fall
>time change, where there are two 1:15am's?  E.g., here is an extract from a
>US Geological Survey web site giving data on the river through our county
>on 2020-11-01, when we changed from PDT to PST,
>https://nwis.waterdata.usgs.gov/wa/nwis/uv/?cb_00010=on&cb_00060=on&cb_00065=on&format=rdb&site_no=12200500&period=&begin_date=2020-11-01&end_date=2020-11-05
>.
>
>The timestamps include the date and time as well as PDT or PST.
>
>river <-
>c("datetime,tz,discharge,height,temp",
>  "2020-11-01 00:00,PDT,20500,16.44,9.3",
>  "2020-11-01 00:15,PDT,20500,16.44,9.3",
>  "2020-11-01 00:30,PDT,20500,16.43,9.3",
>  "2020-11-01 00:45,PDT,20400,16.40,9.3",
>  "2020-11-01 01:00,PDT,20400,16.40,9.3",
>  "2020-11-01 01:00,PST,20200,16.34,9.2",
>  "2020-11-01 01:15,PDT,20400,16.39,9.3",
>  "2020-11-01 01:15,PST,20200,16.34,9.2",
>  "2020-11-01 01:30,PDT,20300,16.37,9.2",
>  "2020-11-01 01:30,PST,20100,16.31,9.2",
>  "2020-11-01 01:45,PDT,20300,16.35,9.2",
>  "2020-11-01 01:45,PST,20100,16.29,9.2",
>  "2020-11-01 02:00,PST,20100,16.29,9.2",
>  "2020-11-01 02:15,PST,20000,16.27,9.1",
>  "2020-11-01 02:30,PST,20000,16.26,9.1"
>  )
>
>The entries are obviously not in time order.
>
>Is there a simple way to read the timedate and tz columns together?  One
>way is to use d\$tz to construct an offset that can be read with
>strptime's "%z".
>
>> d\$POSIXct <-
>as.POSIXct(paste(d\$datetime,ifelse(d\$tz=="PDT","-0700","-0800")),
>format="%Y-%m-%d %H:%M %z")
>> d
>           datetime  tz discharge height temp             POSIXct
>1  2020-11-01 00:00 PDT     20500  16.44  9.3 2020-11-01 00:00:00
>2  2020-11-01 00:15 PDT     20500  16.44  9.3 2020-11-01 00:15:00
>3  2020-11-01 00:30 PDT     20500  16.43  9.3 2020-11-01 00:30:00
>4  2020-11-01 00:45 PDT     20400  16.40  9.3 2020-11-01 00:45:00
>5  2020-11-01 01:00 PDT     20400  16.40  9.3 2020-11-01 01:00:00
>6  2020-11-01 01:00 PST     20200  16.34  9.2 2020-11-01 01:00:00
>7  2020-11-01 01:15 PDT     20400  16.39  9.3 2020-11-01 01:15:00
>8  2020-11-01 01:15 PST     20200  16.34  9.2 2020-11-01 01:15:00
>9  2020-11-01 01:30 PDT     20300  16.37  9.2 2020-11-01 01:30:00
>10 2020-11-01 01:30 PST     20100  16.31  9.2 2020-11-01 01:30:00
>11 2020-11-01 01:45 PDT     20300  16.35  9.2 2020-11-01 01:45:00
>12 2020-11-01 01:45 PST     20100  16.29  9.2 2020-11-01 01:45:00
>13 2020-11-01 02:00 PST     20100  16.29  9.2 2020-11-01 02:00:00
>14 2020-11-01 02:15 PST     20000  16.27  9.1 2020-11-01 02:15:00
>15 2020-11-01 02:30 PST     20000  16.26  9.1 2020-11-01 02:30:00
>> with(d[order(d\$POSIXct),], plot(temp)) # monotonic temperature
>
>-Bill
>
>
>On Thu, Sep 2, 2021 at 12:41 PM Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
>wrote:
>
>> Regardless of whether you use the lower-level split function, or the
>> higher-level aggregate function, or the tidyverse group_by function, the
>> key is learning how to create the column that is the same for all records
>> corresponding to the time interval of interest.
>>
>> If you convert the sampdate to POSIXct, the tz IS important, because most
>> of us use local timezones that respect daylight savings time, and a naive
>> conversion of standard time will run into trouble if R is assuming daylight
>> savings time applies. The lubridate package gets around this by always
>> assuming UTC and giving you a function to "fix" the timezone after the
>> conversion. I prefer to always be specific about timezones, at least by
>> using so something like
>>
>>     Sys.setenv( TZ = "Etc/GMT+8" )
>>
>> which does not respect daylight savings.
>>
>> Regarding using character data for identifying the month, in order to have
>> clean plots of the data I prefer to use the trunc function but it returns a
>> POSIXlt so I convert it to POSIXct:
>>
>>     discharge\$sampmonthbegin <- as.POSIXct( trunc( discharge\$sampdate,
>> units = "months" ) )
>>
>> Then any of various ways can be used to aggregate the records by that
>> column.
>>
>> On September 2, 2021 12:10:15 PM PDT, Andrew Simmons <akwsimmo using gmail.com>
>> wrote:
>> >You could use 'split' to create a list of data frames, and then apply a
>> >function to each to get the means and sds.
>> >
>> >
>> >cols <- "cfs"  # add more as necessary
>> >S <- split(discharge[cols], format(discharge\$sampdate, format = "%Y-%m"))
>> >means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE))
>> >sds   <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm =
>> >TRUE)))
>> >
>> >On Thu, Sep 2, 2021 at 3:01 PM Rich Shepard <rshepard using appl-ecosys.com>
>> >wrote:
>> >
>> >> On Thu, 2 Sep 2021, Rich Shepard wrote:
>> >>
>> >> > If I correctly understand the output of as.POSIXlt each date and time
>> >> > element is separate, so input such as 2016-03-03 12:00 would now be
>> 2016
>> >> 03
>> >> > 03 12 00 (I've not read how the elements are separated). (The TZ is
>> not
>> >> > important because all data are either PST or PDT.)
>> >>
>> >> Using this script:
>> =
>> >> ',', stringsAsFactors = FALSE)
>> >> discharge\$sampdate <- as.POSIXlt(discharge\$sampdate, tz = "",
>> >>                                   format = '%Y-%m-%d %H:%M',
>> >>                                   optional = 'logical')
>> >> discharge\$cfs <- as.numeric(discharge\$cfs, length = 6)
>> >>
>> >> I get this result:
>> >>               sampdate    cfs
>> >> 1 2016-03-03 12:00:00 149000
>> >> 2 2016-03-03 12:10:00 150000
>> >> 3 2016-03-03 12:20:00 151000
>> >> 4 2016-03-03 12:30:00 156000
>> >> 5 2016-03-03 12:40:00 154000
>> >> 6 2016-03-03 12:50:00 150000
>> >>
>> >> I'm completely open to suggestions on using this output to calculate
>> >> monthly
>> >> means and sds.
>> >>
>> >> If dplyr:summarize() will do so please show me how to modify this
>> command:
>> >> disc_monthly <- ( discharge
>> >>          %>% group_by(sampdate)
>> >>          %>% summarize(exp_value = mean(cfs, na.rm = TRUE))
>> >> because it produces daily means, not monthly means.
>> >>
>> >> TIA,
>> >>
>> >> Rich
>> >>
>> >> ______________________________________________
>> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> http://www.R-project.org/posting-guide.html
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help