[R] setting zeros for missing interval in data

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Tue Mar 1 06:49:19 CET 2022


In base R, I wish the question below had been explained better. It is nice that an example was given, albeit misleading for me.

The data shown is not flawed and has nothing inside it that reflects being missing as it first sounded like.

What sounds like it is missing is specific dates entirely. The column called Channel seems irrelevant as it is always 30. Rain fall is always 0.2 or 0.4. The YEAR is always 2021. So the ONLY interesting thing here seems to be TIMESTAMP.

But I am NOT convinced they are missing because the times are all over the place. I mean 10 PM and 5:40 PM  and 5:20 AM and so on. There are multiple rows for the same day.

Yes, there is no info for May 1 and May 6 and 7. I have no idea why but How and why are we supposed to guess that it means no rain versus some other reason? Towards the end, what I think is the real message is shown. The suggestion is there should be data for every five minute period interpolated here

Fair enough. Can I suggest that the data offered to us has the TIMESTAMP field as character, rather than some form of DATE/TIME that can be used in Python?

Converting it or extracting some info into temporary columns might be useful here.

You could then create some kind of data that loops over times starting with your start time, say midnight on the 1st and for every 5 minute interval makes a timestamp that looks like what you need and COMPARE to what is in the data shown. For any that are nor present, you can create a similar row with a zero in it for the RAINFALL field. There are oodles of ways to do that, including some more straightforward than others. Or, you may just make the sequence or all, and later in some kind of merge, only keep ones from the original data if there is a duplicate. Again, many ways, even in base R.

If my analysis is right, and clearly it may not be, a much better way to ask this question might be to say you have timestamped data about rainfall where the readings for every 5 minute interval with no rainfall have been omitted. How do you create records for all 5-minute intervals that are not present and merge that info with the records shown?

As a hint, you can make a sequence like below, with your own adjustments for starting and ending dates.

> seq(from=as.POSIXct("2020-05-01 00:00"),to=as.POSIXct("2020-05-01 01:00"),by="5 min")
 [1] "2020-05-01 00:00:00 EDT" "2020-05-01 00:05:00 EDT" "2020-05-01 00:10:00 EDT"
 [4] "2020-05-01 00:15:00 EDT" "2020-05-01 00:20:00 EDT" "2020-05-01 00:25:00 EDT"
 [7] "2020-05-01 00:30:00 EDT" "2020-05-01 00:35:00 EDT" "2020-05-01 00:40:00 EDT"
[10] "2020-05-01 00:45:00 EDT" "2020-05-01 00:50:00 EDT" "2020-05-01 00:55:00 EDT"
[13] "2020-05-01 01:00:00 EDT"

Of course you may want to know WHY you need the missing data interpolated. Some graphics programs, if properly supplied with actual dates, not character strings, may simply skip missing records and leave room between others. The missing ones might be treated as zero, depending what you are doing.


-----Original Message-----
From: Eliza Botto <eliza_botto using outlook.com>
To: R-help using r-project.org <R-help using r-project.org>
Sent: Mon, Feb 28, 2022 10:52 pm
Subject: [R] setting zeros for missing interval in data

[The data setting in the last email might be faulty]

Dear useRs,

I have the following dataset which represents rainfall data at a 5-minute interval from 1 May 2021 to 30 September 2021.

> dput(YY)

structure(list(CHANNEL = c(30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L), YEAR = c(2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L), TIMESTAMP = c("2021/05/02 10:00:00 PM",
"2021/05/02 10:55:00 PM", "2021/05/04 05:40:00 PM", "2021/05/04 06:50:00 PM",
"2021/05/05 03:05:00 AM", "2021/05/08 05:15:00 AM", "2021/05/08 05:20:00 AM",
"2021/05/08 05:30:00 AM", "2021/05/08 05:50:00 AM", "2021/05/08 06:05:00 AM",
"2021/05/08 07:15:00 AM", "2021/05/08 08:00:00 AM", "2021/05/08 08:05:00 AM",
"2021/05/08 08:15:00 AM", "2021/05/08 08:35:00 AM", "2021/05/08 08:50:00 AM",
"2021/05/08 09:05:00 AM", "2021/05/08 09:30:00 AM", "2021/05/08 09:45:00 AM",
"2021/05/08 09:55:00 AM", "2021/05/08 10:10:00 AM", "2021/05/08 10:20:00 AM",
"2021/05/08 10:40:00 AM", "2021/05/08 10:55:00 AM", "2021/05/08 11:15:00 AM",
"2021/05/08 11:25:00 AM", "2021/05/08 11:35:00 AM", "2021/05/08 11:45:00 AM",
"2021/05/08 11:50:00 AM", "2021/05/08 12:00:00 PM", "2021/05/08 12:05:00 PM",
"2021/05/08 12:15:00 PM", "2021/05/08 12:20:00 PM", "2021/05/08 12:30:00 PM",
"2021/05/08 12:35:00 PM", "2021/05/08 12:50:00 PM", "2021/05/08 01:35:00 PM",
"2021/05/08 01:50:00 PM", "2021/05/08 02:20:00 PM", "2021/05/08 02:30:00 PM",
"2021/05/08 02:35:00 PM", "2021/05/08 03:00:00 PM", "2021/05/08 03:35:00 PM",
"2021/05/08 03:45:00 PM", "2021/05/08 04:30:00 PM", "2021/05/08 04:40:00 PM",
"2021/05/08 04:55:00 PM", "2021/05/08 05:05:00 PM", "2021/05/08 05:20:00 PM",
"2021/05/08 07:25:00 PM", "2021/05/08 09:00:00 PM", "2021/05/08 09:25:00 PM",
"2021/05/08 09:50:00 PM", "2021/05/08 10:15:00 PM", "2021/05/08 10:40:00 PM",
"2021/05/08 11:35:00 PM", "2021/05/09 12:40:00 AM", "2021/05/09 01:10:00 AM",
"2021/05/09 02:10:00 AM", "2021/05/09 06:00:00 AM", "2021/05/09 02:40:00 PM",
"2021/05/09 02:45:00 PM", "2021/05/09 02:50:00 PM", "2021/05/09 02:55:00 PM",
"2021/05/09 03:00:00 PM", "2021/05/09 03:05:00 PM", "2021/05/09 03:10:00 PM",
"2021/05/09 03:15:00 PM", "2021/05/09 03:20:00 PM", "2021/05/09 03:25:00 PM",
"2021/05/09 03:30:00 PM", "2021/05/09 03:35:00 PM", "2021/05/09 03:40:00 PM",
"2021/05/09 03:45:00 PM", "2021/05/09 03:50:00 PM", "2021/05/09 03:55:00 PM",
"2021/05/09 04:00:00 PM", "2021/05/09 04:05:00 PM", "2021/05/09 04:10:00 PM",
"2021/05/09 04:15:00 PM", "2021/05/09 04:25:00 PM", "2021/05/09 04:30:00 PM",
"2021/05/09 04:35:00 PM", "2021/05/09 04:40:00 PM", "2021/05/09 04:45:00 PM",
"2021/05/09 04:50:00 PM", "2021/05/09 05:00:00 PM", "2021/05/09 05:05:00 PM",
"2021/05/09 05:10:00 PM", "2021/05/09 05:20:00 PM", "2021/05/09 05:25:00 PM",
"2021/05/09 05:35:00 PM", "2021/05/09 05:45:00 PM", "2021/05/09 05:50:00 PM",
"2021/05/09 06:00:00 PM", "2021/05/09 06:10:00 PM", "2021/05/09 06:20:00 PM",
"2021/05/09 06:30:00 PM", "2021/05/09 06:40:00 PM", "2021/05/09 06:50:00 PM"
), RAINFALL = c(0.2, 0.2, 0.4, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2
)), row.names = c(276L, 286L, 599L, 773L, 829L, 951L, 955L, 971L,
996L, 1014L, 1123L, 1242L, 1260L, 1301L, 1378L, 1422L, 1456L,
1487L, 1504L, 1515L, 1539L, 1557L, 1597L, 1629L, 1679L, 1708L,
1728L, 1757L, 1775L, 1803L, 1818L, 1846L, 1859L, 1882L, 1892L,
1917L, 1983L, 2007L, 2050L, 2066L, 2077L, 2124L, 2190L, 2207L,
2288L, 2309L, 2334L, 2351L, 2374L, 2518L, 2588L, 2600L, 2616L,
2627L, 2639L, 2655L, 2674L, 2684L, 2725L, 2967L, 3826L, 3830L,
3832L, 3838L, 3842L, 3845L, 3846L, 3851L, 3854L, 3856L, 3861L,
3865L, 3868L, 3871L, 3873L, 3877L, 3880L, 3881L, 3885L, 3888L,
3890L, 3893L, 3897L, 3899L, 3900L, 3902L, 3906L, 3907L, 3910L,
3914L, 3915L, 3917L, 3920L, 3922L, 3923L, 3926L, 3928L, 3931L,
3932L, 3933L), class = "data.frame")

You could clearly see that there are some intervals which are missing from this dataset. For example, the data values for 1st of May are missing. Similarly,

between

30 2021 2021/05/02 10:00:00 PM      0.2

and

30 2021 2021/05/02 10:55:00 PM      0.2

the values of rainfall depth for following "time stamps" are missing because they were "zero"

30 2021 2021/05/02 10:05:00 PM      0.0

30 2021 2021/05/02 10:10:00 PM      0.0

30 2021 2021/05/02 10:15:00 PM      0.0

30 2021 2021/05/02 10:20:00 PM      0.0

30 2021 2021/05/02 10:25:00 PM      0.0

30 2021 2021/05/02 10:30:00 PM      0.0

30 2021 2021/05/02 10:35:00 PM      0.0

30 2021 2021/05/02 10:40:00 PM      0.0

30 2021 2021/05/02 10:45:00 PM      0.0

30 2021 2021/05/02 10:50:00 PM      0.0

So, what I want is a uniform list starting from 2021/05/01 to 2021/09/30 at every 5-minute intervals with "zero" values for the missing intervals in the original data list. I hope my question is clear.

Thank You very much in advance,

Eliza

[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-green-avg-v1.png]<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>     Virus-free. www.avg.com<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>

    [[alternative HTML version deleted]]

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list