[R] run a calculation function over time fields, ordered and grouped by variables

jdnewmil jdnewmil at dcn.davis.ca.us
Tue May 26 02:04:17 CEST 2015


Another way:

#create four columns with route id, stop sequence interval time and 
route start time
ssq <- c( 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 2, 3, 4, 5, 6, 7, 8 )
tint <- c( "00:00", "00:12", "00:03", "00:06", "00:09", "00:02", "00:04"
          , "00:00", "00:08", "00:10", "00:10"
          , "00:00", "00:02", "00:04", "00:08", "00:02", "00:01", "00:01" 
)
tst <- c( rep( "18:20", 7 )
         , rep( "10:50", 4 )
         , rep( "16:15", 7 ) )
rtid <- c( rep( "a", 7 )
          , rep( "b", 4 )
          , rep( "c", 7 ) )
# Don't use cbind to make data frames... it usually ends up
# forcing all columns to be character or factors
# Also, avoid using "df" as a variable name... it is the name of
# a function in base R, so that gets confusing fast
DF <- data.frame( ssq, tint, tst, rtid, stringsAsFactors=FALSE )
DF

#correct data set should look like this
tarr <- c( "18:20", "18:32", "18:35", "18:41", "18:50", "18:52", "18:56"
          , "10:50", "10:58", "11:08", "11:18"
          , "16:15", "16:17", "16:21", "16:29", "16:31", "16:32", "16:33" 
)
DF2  <- data.frame( DF, tarr, stringsAsFactors=FALSE )
DF2

library(dplyr)
DFs <- (   DF
        %>% group_by( rtid )
        %>% mutate( tarr
             = as.character(   as.POSIXct( tst, format="%H:%M", tz="GMT" 
)
                             + as.difftime(
                                 cumsum(
                                   as.numeric(
                                       as.difftime( tint, format="%H:%M" 
)
                                     , units="mins"
                                     )
                                   )
                                 , units="mins"
                                 )
                           , format="%H:%M" )
                  )
        %>% as.data.frame # removes grouping behavior from result
        )
identical( DFs, DF2 )

On 2015-05-25 15:43, Jim Lemon wrote:
> Hi gavinr,
> Perhaps this will do what you want.
> 
> add_HH_MM<-function(x) {
>  t1bits<-strsplit(as.character(x$tst),":")
>  t2bits<-strsplit(as.character(x$tint),":")
> 
> hours<-as.numeric(lapply(t1bits,"[",1))+cumsum(as.numeric(lapply(t2bits,"[",1)))
> 
> minutes<-as.numeric(lapply(t1bits,"[",2))+cumsum(as.numeric(lapply(t2bits,"[",2)))
>  next_hour<-minutes > 59
>  # adjust for running into the next hour
>  minutes[next_hour]<-minutes[next_hour]-60
>  hours[next_hour]<-hours[next_hour]+1
>  # adjust for running into the next day
>  hours[hours > 23]<-hours[hours > 23]-24
> 
> return(paste(formatC(hours,width=2,flag=0),formatC(minutes,width=2,flag=0),sep=":"))
> }
> 
> df$tarr<-unlist(by(df,df$rtid,add_HH_MM))
> 
> Jim
> 
> 
> On Tue, May 26, 2015 at 5:28 AM, gavinr <g.rudge at bham.ac.uk> wrote:
>> I’ve got some transit data relating to bus stops for a GIS data set.  
>> Each
>> row represents one stop on a route.  For each record I have the start 
>> time
>> of the route, a sequence in which a bus stops, the time the bus 
>> arrives at
>> the first stop and the time taken to get to each of the stops from the 
>> last
>> one in the sequence.  Not all sequences of stops starts with the 
>> number 1,
>> some may start with a higher number.
>> I need to make a new variable which has the time the bus arrives at 
>> each
>> stop by using the start time from the stop with the lowest sequence 
>> number,
>> to populate all of the arrival times for each stop in each route.
>> 
>> I have a very simple example below with just three routes and a few 
>> stops in
>> each.  My actual data set has a few million rows.  I've also created a
>> version of the data set I'm aiming to get.
>> 
>> There are two problems here.  Firstly getting the data into the 
>> correct
>> format to do the calculations with
>> durations, and secondly running a function over the data set to obtain 
>> the
>> times.
>> It is the durations that are critical not the date, so using the POSIX
>> methods doesn’t really seem appropriate here.  Ultimately the times 
>> are
>> going to be used in a route solver in an ArcSDE geodatabase.  I tried 
>> to use
>> strptime to format my times, but could not get them into a data.frame 
>> as
>> presumably they are a list.  In this example I’ve left them as 
>> strings.
>> 
>> Any help is much appreciated.
>> 
>> #create four columns with route id, stop sequence interval time and 
>> route
>> start time
>> ssq<-c(3,4,5,6,7,8,9,1,2,3,4,2,3,4,5,6,7,8)
>> tint<-c("00:00","00:12","00:03","00:06","00:09","00:02","00:04","00:00","00:08","00:10","00:10","00:00","00:02","00:04","00:08","00:02","00:01","00:01")
>> tst<-c(rep("18:20",7),rep("10:50",4),rep("16:15",7))
>> rtid<-c(rep("a",7),rep("b",4),rep("c",7))
>> df<-data.frame(cbind(ssq,tint,tst,rtid))
>> df
>> 
>> #correct data set should look like this
>> tarr<-c("18:20","18:32","18:35","18:41","18:50","18:52","18:56","10:50","10:58","11:08","11:18","16:15","16:17","16:21","16:29","16:31","16:32","16:33")
>> df2<-cbind(df,tarr)
>> df2
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://r.789695.n4.nabble.com/run-a-calculation-function-over-time-fields-ordered-and-grouped-by-variables-tp4707655.html
>> Sent from the R help mailing list archive at Nabble.com.
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list