[R] Bus stop sequence matching problem

Gabor Grothendieck ggrothendieck at gmail.com
Sat Aug 30 15:08:46 CEST 2014


Try dtw.  First convert ref to numeric since dtw does not handle
character input.  Then align using dtw and NA out repeated values in
the alignment.  Finally zap ugly row names and calculate loading:

library(dtw)
s1 <- as.numeric(stop_sequence$ref)
s2 <- as.numeric(factor(as.character(stop_onoff$ref),
levels(stop_sequence$ref)))
a <- dtw(s1, s2)
DF <- cbind(stop_sequence,
      stop_onoff[replace(a$index2, c(FALSE, diff(a$index2) == 0), NA), ])[-3]
rownames(DF) <- NULL
transform(DF, loading = cumsum(ifelse(is.na(on), 0, on)) -
                        cumsum(ifelse(is.na(off), 0, off)))

giving:

  seq ref on off loading
1  10   A  5   0       5
2  20   B NA  NA       5
3  30   C NA  NA       5
4  40   D  0   2       3
5  50   B 10   2      11
6  60   A  0   6       5

You will need to test this with more data and tweak it if necessary
via the various dtw arguments.


On Fri, Aug 29, 2014 at 8:46 PM, Adam Lawrence <alaw005 at gmail.com> wrote:
> I am hoping someone can help me with a bus stop sequencing problem in R,
> where I need to match counts of people getting on and off a bus to the
> correct stop in the bus route stop sequence. I have tried looking
> online/forums for sequence matching but seems to refer to numeric sequences
> or DNA matching and over my head. I am after a simple example if anyone can
> please help.
>
> I have two data series as per below (from database), that I want to
> combine. In this example “stop_sequence” includes the equence (seq) of bus
> stops and “stop_onoff” is a count of people getting on and off at certain
> stops (there is no entry if noone gets on or off).
>
> stop_sequence <- data.frame(seq=c(10,20,30,40,50,60),
> ref=c('A','B','C','D','B','A'))
> ##   seq ref
> ## 1  10   A
> ## 2  20   B
> ## 3  30   C
> ## 4  40   D
> ## 5  50   B
> ## 6  60   A
> stop_onoff <-
> data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6))
> ##   ref on off
> ## 1   A  5   0
> ## 2   D  0   2
> ## 3   B 10   2
> ## 4   A  0   6
>
> I need to match the stop_onoff numbers in the right sto sequence, with the
> correctly matched output as follows (load is a cumulative count of on and
> off)
>
> desired_output <- data.frame(seq=c(10,20,30,40,50,60),
> ref=c('A','B','C','D','B','A'),
> on=c(5,'-','-',0,10,0),off=c(0,'-','-',2,2,6), load=c(5,0,0,3,11,5))
> ##   seq ref on off load
> ## 1  10   A  5   0    5
> ## 2  20   B  -   -    0
> ## 3  30   C  -   -    0
> ## 4  40   D  0   2    3
> ## 5  50   B 10   2   11
> ## 6  60   A  0   6    5
>
> In this example the stop “B” is matched to the second stop “B” in the stop
> sequence and not the first because the onoff data is after stop “D”.
>
> Any guidance much appreciated.
>
> Regards
> Adam
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list