[R] Calculate daily means from 5-minute interval data

Richard O'Keefe r@oknz @end|ng |rom gm@||@com
Mon Aug 30 12:34:19 CEST 2021

```It is not clear to me who Jeff Newmiller's comment about periodicity
The original poster, for asking for daily summaries?
A summary of what I wrote:
- daily means and standard deviations are a very poor choice for river flow data
- if you insist on doing that anyway, no fancy packages are required, just
reshape the data into a matrix where rows correspond to days using matrix()
and summarise it using rowMeans() and apply(... FUN=sd).
- but it is quite revealing to just plot the data using image(), which makes no
assumptions about periodicity or anything else, just a way of wrapping 1D
data to fill a 2D space and still have interpretable axes
- the river data I examined showed fairly boring time series interrupted by
substantial shocks (caused by rainfall in catchment areas).

New stuff...
The river data I looked at came from Environment Canterbury.
River flows there are driven by (a) snow-melt from the Southern Alps.
which *is* roughly periodic with a period of one year and (b) rainfall
events which charge the upstream catchment areas, leading to a
rapid ramp up following by a slower exponential-looking decay.
The (a) elemen happens to be invisible in the Environment Canterbury
data, as they only release the latest month of flow data, The ratio
between low flows and high flows ranged from 2 to 10 in the data I
could get.

The (b) component is NOT periodic and is NOT aligned with days
and is NOT predictable and is extremely important.

Where you begin is not with R or a search for packages but with
the question "what is actually going on in the real world?  What
are the influences on river flow, are they natural (and which) or
human (and which)?"  It's going to matter a lot how much
irrigation water is drawn from a river, and that may be roughly
predictable.  If water is occasionally diverted into another river
for flood control, that's going to make a difference.  If there is a
dam, that's going to make a difference.  Rainfall and snowmelt
are going to be seasonal (in a hand-wavy sense) but differently so.

And there is an equally important question:  "Why am I doing this?
What do I want to see in the data that doesn't already leap to the
eye?  What is anyone going to DO differently if they see that?"
Are you interested in whether minimum flows are adequate for
irrigation or whether flood control systems are adequate for high
flows?

Thinking about the people who might read my report, if I were
tasked with analysing river data, I would want to analyse the
data and present the results in such a way that most of them
would say "Why did I need this guy?  It's so obvious!  I could
have done that!  (If I had ever thought of it.)"  But that is because
I am thinking of farmers and politicians who have other maddened
grizzly bears to stun (thanks, Terry Pratchett).  If writing for an
audience of hydrologists and statisticians, you would make
different choices.

Here's a little bit of insight from the physics.
Why is it that the spikes in the flows rise rapidly and fall slowly?
Because the fall is limited by the rate at which the river system
can carry water away, but the rate at which a storm can deliver
water to the river system is not.  Did I know this before looking
at the ECan data?  Well, I had *seen* rivers rising rapidly and
about it.   But now that I have, it's *obvious*: you cannot
understand the river without understanding the weather that
the river is subject to.  Anyone who genuinely understands
hydrology is looking at me sadly and saying "Just now you
figured this out?  At your mother's knee you didn't learn this?"
But it has such repercussions.  It means you need data on
rainfall in the catchment areas.  (Which ECan, to their credit,
also provide.)  In an important sense, there is no right way to
analyse river flow data *on its own*.

On Mon, 30 Aug 2021 at 14:47, Jeff Newmiller <jdnewmil using dcn.davis.ca.us> wrote:
>
> IMO assuming periodicity is a bad practice for this. Missing timestamps happen too, and there is no reason to build a broken analysis process.
>
> On August 29, 2021 7:09:01 PM PDT, Richard O'Keefe <raoknz using gmail.com> wrote:
> >Why would you need a package for this?
> >> samples.per.day <- 12*24
> >
> >That's 12 5-minute intervals per hour and 24 hours per day.
> >Generate some fake data.
> >
> >> x <- rnorm(samples.per.day * 365)
> >> length(x)
> >[1] 105120
> >
> >Reshape the fake data into a matrix where each row represents one
> >24-hour period.
> >
> >> m <- matrix(x, ncol=samples.per.day, byrow=TRUE)
> >
> >Now we can summarise the rows any way we want.
> >The basic tool here is ?apply.
> >?rowMeans is said to be faster than using apply to calculate means,
> >so we'll use that.  There is no *rowSds so we have to use apply
> >for the standard deviation.  I use ?head because I don't want to
> >post tens of thousands of meaningless numbers.
> >
> >[1] -0.03510177  0.11817337  0.06725203 -0.03578195 -0.02448077 -0.03033692
> >[1] 1.0017718 0.9922920 1.0100550 0.9956810 1.0077477 0.9833144
> >
> >Now whether this is a *sensible* way to summarise your flow data is a question
> >that a hydrologist would be better placed to answer.  I would have started with
> >> plot(density(x))
> >which I just did with some real river data (only a month of it, sigh).
> >Very long tail.
> >Even
> >> plot(density(log(r)))
> >shows a very long tail.  Time to plot the data against time.  Oh my!
> >All of the long tail came from a single event.
> >There's a period of low flow, then there's a big rainstorm and the
> >flow goes WAY up, then over about two days the flow subsides to a new
> >somewhat higher level.
> >
> >None of this is reflected in means or standard deviations.
> >This is *time series* data, and time series data of a fairly special kind.
> >
> >One thing that might be helpful with your data would simply be
> >> image(log(m))
> >For my one month sample, the spike showed up very clearly that way.
> >Because right now, your first task is to get an idea of what the data
> >look like, and means-and-standard-deviations won't really do that.
> >
> >Oh heck, here's another reason to go with image(log(m)).
> >With image(m) I just see the one big spike.
> >With image(log(m)), I can see that little spikes often start in the
> >afternoon of one day and continue into the morning of the next.
> >From daily means, it looks like two unusual, but not very
> >unusual, days.  From the image, it's clearly ONE rainfall event
> >that just happens to straddle a day boundary.
> >
> >This is all very basic stuff, which is really the point.  You want to use
> >elementary tools to look at the data before you reach for fancy ones.
> >
> >
> >On Mon, 30 Aug 2021 at 03:09, Rich Shepard <rshepard using appl-ecosys.com> wrote:
> >>
> >> I have a year's hydraulic data (discharge, stage height, velocity, etc.)
> >> from a USGS monitoring gauge recording values every 5 minutes. The data
> >> files contain 90K-93K lines and plotting all these data would produce a
> >> solid block of color.
> >>
> >> What I want are the daily means and standard deviation from these data.
> >>
> >> As an occasional R user (depending on project needs) I've no idea what
> >> packages could be applied to these data frames. There likely are multiple
> >> paths to extracting these daily values so summary statistics can be
> >> calculated and plotted. I'd appreciate suggestions on where to start to
> >> learn how I can do this.
> >>
> >> TIA,
> >>
> >> Rich
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> >______________________________________________
> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help