[R] Processing key_column, begin_date, end_date in R

jim holtman jholtman at gmail.com
Thu Feb 26 16:23:57 CET 2015


here is yet another way:

> dta <- read.csv( text=
+ "key_column,begin_date,end_date
+  123456,2013-01-01,2014-01-01
+  123456,2013-07-01,2014-07-01
+  789102,2012-03-01,2014-03-01
+  789102,2015-02-01,2016-02-01
+  789102,2015-02-06,2016-02-06
+ 789102,2015-02-28,2015-03-31
+  789102,2015-04-30,2015-05-31"
+  , as.is=TRUE)
>
> # check for overlap by sorting into time order and the adding 1 for
> # begin and -1 for end and create cumsum
> # select only resulting entries with begin @ 1 and end @ 0
> dta <- dta %>%
+     mutate(begin_date = as.Date(begin_date)  # convert the times
+         , end_date = as.Date(end_date)
+         ) %>%
+     gather(time, value, -key_column) %>%  # create 'long' data
+     mutate(oper = ifelse(grepl('^b', time), 1, -1)) %>%  # value for begin/end
+     arrange(value) %>%  # sort by time
+     group_by(key_column) %>%  # separate into groups
+     mutate(depth = cumsum(oper)) %>%
+     filter((grepl("^b", time) & depth == 1) |  # filter on begin at 1 and end at 0
+           (grepl("^e", time) & depth == 0)
+           ) %>%
+     ungroup() %>%
+     arrange(key_column, value)
> # now have pairs of lines for the times
> indx <- seq(1, nrow(dta), 2)
> result <- data.frame(key_column = dta$key_column[indx]
+                 , begin_time = dta$value[indx]
+                 , end_time = dta$value[indx + 1L]
+                 , stringsAsFactors = FALSE
+                 )
> result
  key_column begin_time   end_time
1     123456 2013-01-01 2014-07-01
2     789102 2012-03-01 2014-03-01
3     789102 2015-02-01 2016-02-06
>

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Wed, Feb 25, 2015 at 4:18 PM, Matt Gross <grossm at gmail.com> wrote:
> Hi,
>
> I am trying to process a large dataset in R.  The dataset contains the
> following three columns:
>
> key_column - a unique key identifier
> begin_date - the start date of the active period
> end_date - the end date of the active period
>
>
> Example data is here:
>
> key_column,begin_date,end_date
> 123456,2013-01-01,2014-01-01
> 123456,2013-07-01,2014-07-01
> 789102,2012-03-01,2014-03-01
> 789102,2015-02-01,2016-02-01
> 789102,2015-02-06,2016-02-06
>
> I want to build a condensed table of key_column and begin_date's and
> end_date's.  As you can see in the example data above, some begin and end
> date periods overlap with begin_date and end_date pairs for the same
> key_column.  In situations where overlap exists I want to have one record
> for the key_column with the min(begin_date) and the max(end_date).
>
> Can anyone help me build the commands to process this data in R?
>
> Thanks,
> Matt
>
> --
> Matt Gross
> grossm at gmail.com
> 503.329.4545
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list