[R] How to select max data according to week?

Marc Schwartz m@rc_@chw@rtz @end|ng |rom me@com
Wed Jun 19 17:30:26 CEST 2019


Hi All,

Is there a reason that the following was not considered?

> DF
         Date O3_Conc
1  2000-01-01   0.033
2  2000-01-02   0.023
3  2000-01-03   0.025
4  2000-01-04   0.041
5  2000-01-05   0.063
6  2000-01-06   0.028
7  2000-01-07   0.068
8  2000-01-08   0.048
9  2000-01-09   0.037
10 2000-01-10   0.042
11 2000-01-11   0.027
12 2000-01-12   0.035
13 2000-01-13   0.063
14 2000-01-14   0.035
15 2000-01-15   0.042
16 2000-01-16   0.028

> str(DF)
'data.frame':	16 obs. of  2 variables:
 $ Date   : Date, format: "2000-01-01" ...
 $ O3_Conc: num  0.033 0.023 0.025 0.041 0.063 0.028 0.068 0.048 0.037 0.042 ...


> aggregate(O3_Conc ~ cut(Date, breaks = "14 days"), data = DF, max)
  cut(Date, breaks = "14 days") O3_Conc
1                    2000-01-01   0.068
2                    2000-01-15   0.042


See ?cut.Date

Note that there is an error in Aisyah's original post, in that the value for the max concentration in the second 14 day window is 0.042, not 0.061.

Also, if one wanted to use actual calendar weeks, with a week starting on Sunday, as opposed to 14 day windows starting from the minimum date, you could use:

> aggregate(O3_Conc ~ cut(Date, breaks = "2 weeks", start.on.monday = FALSE), data = DF, max)
  cut(Date, breaks = "2 weeks", start.on.monday = FALSE) O3_Conc
1                                             1999-12-26   0.068
2                                             2000-01-09   0.063


If desired, you could pre-create a column in the DF with the result of cut():

DF$TimeStartDays <- cut(DF$Date, breaks = "14 days")
DF$TimeStartWks <- cut(DF$Date, breaks = "2 weeks", start.on.monday = FALSE)

> DF
         Date O3_Conc TimeStartDays TimeStartWks
1  2000-01-01   0.033    2000-01-01   1999-12-26
2  2000-01-02   0.023    2000-01-01   1999-12-26
3  2000-01-03   0.025    2000-01-01   1999-12-26
4  2000-01-04   0.041    2000-01-01   1999-12-26
5  2000-01-05   0.063    2000-01-01   1999-12-26
6  2000-01-06   0.028    2000-01-01   1999-12-26
7  2000-01-07   0.068    2000-01-01   1999-12-26
8  2000-01-08   0.048    2000-01-01   1999-12-26
9  2000-01-09   0.037    2000-01-01   2000-01-09
10 2000-01-10   0.042    2000-01-01   2000-01-09
11 2000-01-11   0.027    2000-01-01   2000-01-09
12 2000-01-12   0.035    2000-01-01   2000-01-09
13 2000-01-13   0.063    2000-01-01   2000-01-09
14 2000-01-14   0.035    2000-01-01   2000-01-09
15 2000-01-15   0.042    2000-01-15   2000-01-09
16 2000-01-16   0.028    2000-01-15   2000-01-09

and then use:

> aggregate(O3_Conc ~ TimeStartDays, data = DF, max)
  TimeStartDays O3_Conc
1    2000-01-01   0.068
2    2000-01-15   0.042

> aggregate(O3_Conc ~ TimeStartWks, data = DF, max)
  TimeStartWks O3_Conc
1   1999-12-26   0.068
2   2000-01-09   0.063


Regards,

Marc Schwartz



> On Jun 19, 2019, at 10:45 AM, Eric Berger <ericjberger using gmail.com> wrote:
> 
> I just realized that my method as written only works if the entire date
> range is within a single calendar year.
> Here is a revised version that should handle the general case.
> 
> In this version I assume that the conversion from the original numbers to
> dates has already been done
> e.g. it can be done using ymd() as I showed before
> 
> library(dplyr)
> library(lubridate)
> 
> ## create some example data for 36 months from 2010
> d <- seq(from=as.Date("2010-01-01"),to=as.Date("2012-12-31"),by="1 day")
> conc <- runif(length(d)) # concentrations
> df <- data.frame( dt=d, conc=conc)
> df2 <- dplyr::mutate( df, wkpair=as.integer(floor((0:(nrow(df)-1))/14)) )
> df3 <- dplyr::group_by(df2,wkpair) %>%
>       dplyr::summarise(from=min(dt), to=max(dt), maxconc=max(conc)) %>%
>       dplyr::select(from,to,maxconc)
> df3
> 
> # A tibble: 79 x 3
>   from       to         maxconc
>   <date>     <date>       <dbl>
> 1 2010-01-01 2010-01-14   0.928
> 2 2010-01-15 2010-01-28   0.834
> 3 2010-01-29 2010-02-11   0.989
> 4 2010-02-12 2010-02-25   0.983
> 5 2010-02-26 2010-03-11   0.898
> 6 2010-03-12 2010-03-25   0.864
> 7 2010-03-26 2010-04-08   0.920
> 8 2010-04-09 2010-04-22   0.925
> 9 2010-04-23 2010-05-06   0.998
> 10 2010-05-07 2010-05-20   0.761
> # ... with 69 more rows
> 
> On Wed, Jun 19, 2019 at 5:08 PM Eric Berger <ericjberger using gmail.com> wrote:
> 
>> Hi Bert,
>> I agree that our interpretation is slightly different, which is why I
>> wrote:
>> "If one can take the actual week of the year as an acceptable definition
>> of week, then here's my approach."
>> 
>> Regards,
>> Eric
>> 
>> On Wed, Jun 19, 2019 at 5:04 PM Bert Gunter <bgunter.4567 using gmail.com>
>> wrote:
>> 
>>> Eric:
>>> 
>>> I believe you're doing something different than I did. I broke up each
>>> month into biweekly periods, 2+ per month. You seem to be grouping the
>>> overall entire period into biweekly intervals -- apologies if I'm wrong,
>>> but if I understood correctly, that's not the same thing. I do not know
>>> which of us -- if either -- has interpreted her query correctly.
>>> 
>>> Cheers,
>>> Bert
>>> 
>>> On Wed, Jun 19, 2019 at 2:35 AM Eric Berger <ericjberger using gmail.com>
>>> wrote:
>>> 
>>>> Hi Siti,
>>>> I didn't test Bert's code but I assume it's fine. :-)
>>>> I would take a different approach than Bert. I prefer to use a package
>>>> such as lubridate to handle the date wrangling, and a package such as dplyr
>>>> to handle the grouping and max extraction.
>>>> It may be overkill for this problem, but these are great packages to
>>>> become familiar with.
>>>> If one can take the actual week of the year as an acceptable definition
>>>> of week, then here's my approach.
>>>> 
>>>> library(lubridate)
>>>> library(dplyr)
>>>> 
>>>> # Step 1: start with Bert's code to create sample data
>>>> ## create some example data for 3 months in 2000
>>>> d<- 2e7 +c(113:131,201:228, 301:330) ## dates
>>>> conc <- runif(length(d)) # concentrations
>>>> 
>>>> # Step 2: collect the data into a data frame
>>>> df <- data.frame( dt=d, conc=conc)
>>>> 
>>>> # Step 3: use lubridate's ymd() function to parse the dates, its week()
>>>> function to identify the week of the year, and define the new column
>>>> 'wkpair' that groups the weeks 2-at-a-time
>>>> df2 <- dplyr::mutate( df,
>>>> wkpair=as.integer(floor(lubridate::week(lubridate::ymd(dt) )/2)) )
>>>> 
>>>> # Step 4: group by the wkpair and use dplyr's summarise to get the info
>>>> you wanted
>>>> df3 <- dplyr::group_by(df2,wkpair) %>%
>>>>          dplyr::summarise( from=min(dt), to=max(dt), maxconc=max(conc))
>>>> %>%
>>>>          dplyr::select(from,to,maxconc)
>>>> 
>>>> df3
>>>> 
>>>> # A tibble: 6 x 3
>>>>      from       to maxconc
>>>>     <dbl>    <dbl>   <dbl>
>>>> 1 20000113 20000121   0.963
>>>> 2 20000122 20000204   0.988
>>>> 3 20000205 20000218   0.939
>>>> 4 20000219 20000303   0.883
>>>> 5 20000304 20000317   0.863
>>>> 6 20000318 20000330   0.765
>>>> 
>>>> HTH,
>>>> Eric
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jun 18, 2019 at 9:39 PM Bert Gunter <bgunter.4567 using gmail.com>
>>>> wrote:
>>>> 
>>>>> My apologies. I negected to cc r-help. -- Bert
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Jun 18, 2019 at 11:21 AM Bert Gunter <bgunter.4567 using gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> I assume that 20000215 means year 2000, month 2, day 15.
>>>>>> I also assume that you want maxes for the first 2 weeks of a month,
>>>>> the
>>>>>> second 2 weeks, and any remaining days.
>>>>>> I also assume that this might be desired for arbitrary numbers of
>>>>> years,
>>>>>> months, and days.
>>>>>> 
>>>>>> The following is one way to do this. As it's a slight pain to cut and
>>>>>> paste email data as text into R (use ?dput or R code to run to provide
>>>>>> example data instead), I just made up my own. You'll have to do the
>>>>>> following within a data frame through extraction or by using with() of
>>>>>> course.
>>>>>> 
>>>>>> ## create some example data for 3 months in 2000
>>>>>> d<- 2e7 +c(113:131,201:228, 301:330) ## dates
>>>>>> conc <- runif(length(d)) # concentrations
>>>>>> 
>>>>>> ## convert the date to character to extract year, month, and day
>>>>>> cdate <- as.character(d)
>>>>>> ## use substr to to the extraction
>>>>>> year <- substr(cdate,1,4)
>>>>>> mon <- substr(cdate,5,6)
>>>>>> day <- substr(cdate, 7,8)
>>>>>> 
>>>>>> ## convert day to numeric and use cut() to group into the biweekly
>>>>> periods.
>>>>>> d14 <- cut(as.numeric(day), c(0,14.5,28.5, 32))
>>>>>> 
>>>>>> ## Use tapply() to create your desired table of results.
>>>>>> tapply(conc, list(year, d14, mon), max, na.rm = TRUE)
>>>>>> 
>>>>>> ## Results
>>>>>> 
>>>>>> , , 01
>>>>>> 
>>>>>>      (0,14.5] (14.5,28.5] (28.5,32]
>>>>>> 2000 0.7357389   0.9655391 0.7962965
>>>>>> 
>>>>>> , , 02
>>>>>> 
>>>>>>      (0,14.5] (14.5,28.5] (28.5,32]
>>>>>> 2000 0.8193979   0.9487207        NA
>>>>>> 
>>>>>> , , 03
>>>>>> 
>>>>>>      (0,14.5] (14.5,28.5] (28.5,32]
>>>>>> 2000 0.9718919   0.9997093  0.168659
>>>>>> 
>>>>>> 
>>>>>> Cheers,
>>>>>> Bert
>>>>>> 
>>>>>> Bert Gunter
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jun 18, 2019 at 8:53 AM SITI AISYAH BINTI ZAKARIA <
>>>>>> aisyahzakaria using unimap.edu.my> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm Aisyah..I have a problem to run my R coding. I want to select
>>>>> maximum
>>>>>>> value according to week.
>>>>>>> 
>>>>>>> here is my data
>>>>>>> 
>>>>>>> Date          O3_Conc
>>>>>>> 20000101        0.033
>>>>>>> 20000102        0.023
>>>>>>> 20000103        0.025
>>>>>>> 20000104        0.041
>>>>>>> 20000105        0.063
>>>>>>> 20000106        0.028
>>>>>>> 20000107        0.068
>>>>>>> 20000108        0.048
>>>>>>> 20000109        0.037
>>>>>>> 20000110        0.042
>>>>>>> 20000111        0.027
>>>>>>> 20000112        0.035
>>>>>>> 20000113        0.063
>>>>>>> 20000114        0.035
>>>>>>> 20000115        0.042
>>>>>>> 20000116        0.028
>>>>>>> 
>>>>>>> I want to find the max value from column O3_Conc for only 14 days
>>>>> that
>>>>>>> refer to biweekly in month. And the next 14 days for the max value.
>>>>>>> 
>>>>>>> I hope that I can get the result like this:
>>>>>>> 
>>>>>>> Date                     Max O3_Conc
>>>>>>> 20000101 - 20000114        0.068
>>>>>>> 20000115 - 20000129        0.061
>>>>>>> 
>>>>>>> I try many coding but still unavailable.
>>>>>>> 
>>>>>>> this example my coding
>>>>>>> 
>>>>>>> library(plyr)
>>>>>>>      data.frame(CA0003)
>>>>>>> 
>>>>>>>      # format weeks as per requirement (replace "00" with "52" and
>>>>>>> adjust corresponding year)
>>>>>>>      tmp <- list()
>>>>>>>      tmp$y <- format(df$Date, format="%Y")
>>>>>>>      tmp$w <- format(df$Date, format="%U")
>>>>>>>      tmp$y[tmp$w=="00"] <-
>>>>> as.character(as.numeric(tmp$y[tmp$w=="00"]) -
>>>>>>> 14)
>>>>>>>      tmp$w[tmp$w=="00"] <- "884"
>>>>>>>      df$week <- paste(tmp$y, tmp$w, sep = "-")
>>>>>>> 
>>>>>>>      # get summary
>>>>>>>      df2 <- ddply(df, .(week),transform, O3_Conc=max(O3_Conc))
>>>>>>> 
>>>>>>>      # include week ending date
>>>>>>>      tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week
>>>>> ==x,
>>>>>>> "O3_Conc"])[[1]])
>>>>>>>      df2$week.ending <- sapply(tmp$week.ending, max(O3_Conc, TRUE)
>>>>>>> 
>>>>>>> output
>>>>>>>        Site_Id Site_Location
>>>>> Date
>>>>>>>      Year    O3_Conc Month   Day     week
>>>>>>> 1       CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000101
>>>>>>>      2000    0.033   1       1       NULL-NULL
>>>>>>> 2       CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000102
>>>>>>>      2000    0.023   1       2       NULL-NULL
>>>>>>> 3       CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000103
>>>>>>>      2000    0.025   1       3       NULL-NULL
>>>>>>> 4       CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000104
>>>>>>>      2000    0.041   1       4       NULL-NULL
>>>>>>> 5       CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000105
>>>>>>>      2000    0.063   1       5       NULL-NULL
>>>>>>> 6       CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000106
>>>>>>>      2000    0.028   1       6       NULL-NULL
>>>>>>> 7       CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000107
>>>>>>>      2000    0.068   1       7       NULL-NULL
>>>>>>> 8       CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000108
>>>>>>>      2000    0.048   1       8       NULL-NULL
>>>>>>> 9       CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000109
>>>>>>>      2000    0.037   1       9       NULL-NULL
>>>>>>> 10      CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000110
>>>>>>>      2000    0.042   1       10      NULL-NULL
>>>>>>> 11      CA0003  Sek. Keb. Cederawasih, Taman Inderawasih, Perai
>>>>> 20000111
>>>>>>>      2000    0.027   1       11      NULL-NULL
>>>>>>> 



More information about the R-help mailing list