[R] Corrected - R 3.0.2 How to Split-Apply-Combine using various Columns

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Sun Jan 26 05:24:10 CET 2014


But still wrong... needs to be calculated in two steps because the cars are seen in multiple frames.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

jim holtman <jholtman at gmail.com> wrote:
>FYI, increased it to 3M rows of data and it took 0.3 seconds to
>summarize it:
>
>> system.time({
>+ # summarize the slot & Class
>+ result <- DF[
>+     , list(Total = .N  # count of entries
>+         , mVel = mean(velocity)  # average velocity
>+         )
>+     , keyby = 'slot,Class'
>+     ]
>+ })
>   user  system elapsed
>   0.30    0.00    0.29
>> dim(DF)
>[1] 3000000       5
>> head(result, 20)
>                   slot Class Total     mVel
> 1:            [22,322]     1 32449 37.70693
> 2:            [22,322]     2 32741 37.49304
> 3:            [22,322]     3 32213 37.40232
> 4:           (322,622]     1 32257 37.38503
> 5:           (322,622]     2 31912 37.35941
> 6:           (322,622]     3 32360 37.43757
> 7:           (622,922]     1 32359 37.48244
> 8:           (622,922]     2 32505 37.57031
> 9:           (622,922]     3 31966 37.58736
>10:      (922,1.22e+03]     1 32228 37.34592
>11:      (922,1.22e+03]     2 32099 37.44663
>12:      (922,1.22e+03]     3 32054 37.51975
>13: (1.22e+03,1.52e+03]     1 32230 37.50131
>14: (1.22e+03,1.52e+03]     2 32353 37.50793
>15: (1.22e+03,1.52e+03]     3 32280 37.46094
>16: (1.52e+03,1.82e+03]     1 31957 37.51671
>17: (1.52e+03,1.82e+03]     2 32030 37.48419
>18: (1.52e+03,1.82e+03]     3 32365 37.49100
>19: (1.82e+03,2.12e+03]     1 32244 37.59558
>20: (1.82e+03,2.12e+03]     2 32309 37.49966
>>
>
>Jim Holtman
>Data Munger Guru
>
>What is the problem that you are trying to solve?
>Tell me what you want to do, not how you want to do it.
>
>
>On Sat, Jan 25, 2014 at 9:52 PM, jim holtman <jholtman at gmail.com>
>wrote:
>> Here is another way of doing it using the data.table package.  Tried
>> using new new 'dplyr' package, but it had problems.
>>
>>> # create some test data
>>> N <- 30000  # rows of data
>>> set.seed(1)
>>> DF <- data.frame(vehicle = sample(1:5, N, TRUE)
>> +             , frame = sample(22:9322, N, TRUE)
>> +             , velocity = runif(N, 15, 60)
>> +             , Class = sample(1:3, N, TRUE)
>> +             )
>>> # break into slot (as previous post)
>>> DF$slot <- cut(DF$frame, seq(22, 9322, 300), include.lowest = TRUE)
>>> str(DF)
>> 'data.frame':   30000 obs. of  5 variables:
>>  $ vehicle : int  2 2 3 5 2 5 5 4 4 1 ...
>>  $ frame   : int  3884 1333 4277 7490 3008 8096 2318 7662 2017 5324
>...
>>  $ velocity: num  47.9 44.1 51.6 19.7 30 ...
>>  $ Class   : int  3 1 3 3 1 1 3 3 3 3 ...
>>  $ slot    : Factor w/ 31 levels "[22,322]","(322,622]",..: 13 5 15
>25
>> 10 27 8 26 7 18 ...
>>>
>>> # use the data.table package
>>> require(data.table)
>>> DF <- data.table(DF)
>>>
>>> # summarize the slot & Class
>>> result <- DF[
>> +     , list(Total = .N  # count of entries
>> +         , mVel = mean(velocity)  # average velocity
>> +         )
>> +     , keyby = 'slot,Class'
>> +     ]
>>>
>>> head(result, 20)
>>                    slot Class Total     mVel
>>  1:            [22,322]     1   317 38.68162
>>  2:            [22,322]     2   291 38.20892
>>  3:            [22,322]     3   356 38.18092
>>  4:           (322,622]     1   313 37.46468
>>  5:           (322,622]     2   326 38.35842
>>  6:           (322,622]     3   329 37.46077
>>  7:           (622,922]     1   327 37.06243
>>  8:           (622,922]     2   321 37.01083
>>  9:           (622,922]     3   351 37.12343
>> 10:      (922,1.22e+03]     1   361 36.76076
>> 11:      (922,1.22e+03]     2   343 38.48242
>> 12:      (922,1.22e+03]     3   312 37.64318
>> 13: (1.22e+03,1.52e+03]     1   304 36.89459
>> 14: (1.22e+03,1.52e+03]     2   324 36.60410
>> 15: (1.22e+03,1.52e+03]     3   343 36.11219
>> 16: (1.52e+03,1.82e+03]     1   313 38.71039
>> 17: (1.52e+03,1.82e+03]     2   305 38.29188
>> 18: (1.52e+03,1.82e+03]     3   308 38.61945
>> 19: (1.82e+03,2.12e+03]     1   308 37.69332
>> 20: (1.82e+03,2.12e+03]     2   335 37.38025
>>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>>
>> On Sat, Jan 25, 2014 at 4:38 PM, Jeff Newmiller
>> <jdnewmil at dcn.davis.ca.us> wrote:
>>> Sorry, messed up the second ddply example:
>>>
>>> dta3 <- ddply( dta2, c("slot","classf"), function(DF){data.frame(
>Total=nrow(DF),
>>> MeanVelocity= mean( DF$TimeMeanVelocity ) ) } )
>>>
>>>
>---------------------------------------------------------------------------
>>> Jeff Newmiller                        The     .....       .....  Go
>Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
>Live Go...
>>>                                       Live:   OO#.. Dead: OO#.. 
>Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#. 
>with
>>> /Software/Embedded Controllers)               .OO#.       .OO#. 
>rocks...1k
>>>
>---------------------------------------------------------------------------
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
>>>>While you seem to be making some progress in communicating your
>>>>problem, the format is still HTML (so it is a mess) and the subject
>and
>>>>approach of the question are still a poor fit for this list. We are
>not
>>>>here to DO your work for you, yet you seem to have an overly long
>list
>>>>of "needs" that suggest you want a complete solution. What you
>should
>>>>be looking for here are suggestions for how to solve pieces of this
>>>>task so that you can do the work of creating your own solution.
>>>>Some tools that I find useful for this kind of problem are the cut
>>>>function, the plyr package, and the reshape2 package. Others might
>find
>>>>the aggregate function or the sqldf package or the datatable package
>or
>>>>the new dplyr package helpful. Each function and package has
>>>>documentation with examples that you should read before using them
>>>>(e.g. ?cut).
>>>>
>>>>Some example calculations are (with dta as your sample data frame):
>>>>
>>>>library(plyr)
>>>>dta$slot <- cut( dta$frame, seq(22,9322,300))
>>>>dta$classf <- factor(dta$class, levels=1:3,
>>>>labels=c("motorcycle","car","truck"))
>>>>dta2 <- ddply( dta, c("slot","classf","vehicle"),
>>>>function(DF){data.frame( TimeMeanVelocity=mean(DF$velocity) ) } )
>>>>dta3 <- ddply( dta2, c("slot","classf"), function(DF){data.frame(
>>>>MeanVelocity=mean( Total=nrow(DF), DF$TimeMeanVelocity ) ) } )
>>>>
>>>>Then you need to fold the total and mean velocity into wide form
>using
>>>>the dcast function from the reshape2 package (read the
>documentation)
>>>>and merge them with the merge or cbind functions.
>>>>
>>>>Good luck, and keep working on making your questions clear.
>>>>---------------------------------------------------------------------------
>>>>Jeff Newmiller                        The     .....       .....  Go
>>>>Live...
>>>>DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
>Live
>>>>Go...
>>>>                                     Live:   OO#.. Dead: OO#.. 
>Playing
>>>>Research Engineer (Solar/Batteries            O.O#.       #.O#. 
>with
>>>>/Software/Embedded Controllers)               .OO#.       .OO#.
>>>>rocks...1k
>>>>---------------------------------------------------------------------------
>>>>
>>>>Sent from my phone. Please excuse my brevity.
>>>>
>>>>umair durrani <umairdurrani at outlook.com> wrote:
>>>>>Hello everyone,Here is the version using dput. I am sorry for the
>junk
>>>>>I posted before. I have a large vehicle trajectory data of which
>>>>>following is a small part:
>>>>>structure(list(vehicle = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,2L,
>2L),
>>>>>frame = c(221L, 222L, 223L, 224L, 115L, 116L, 117L, 118L, 119L,
>120L,
>>>>>121L), globalx = c(6451259.685, 6451261.244, 6451262.831,
>6451264.362,
>>>>>6451181.179, 6451183.532, 6451185.884, 6451188.237, 6451190.609,
>>>>>6451192.912, 6451195.132), class = c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
>2L,
>>>>>2L, 2L, 2L), velocity = c(23.37, 23.16, 22.94, 22.85, 35, 35.01,
>>>>35.03,
>>>>>34.92, 34.49, 33.66, 32.5), lane = c(5L, 5L, 5L, 5L, 4L, 4L, 4L,
>4L,
>>>>>4L, 4L, 4L)), .Names = c("vehicle", "frame", "globalx", "class",
>>>>>"velocity", "lane"), row.names = c(85L, 86L, 87L, 88L, 447L, 448L,
>>>>>449L, 450L, 451L, 452L, 453L), class = "data.frame")
>>>>>Explanation of Columns:vehicle = unique ID of vehicle. It is
>repeated
>>>>>(in column) for every frame in which it was observed;frame= ID of
>the
>>>>>frame in which the vehicle was observed. One frame is 0.1 seconds
>>>>>long;class = class of vehicle i.e. 1=motorcycle, 2=car,
>>>>>3=truck;velocity= velocity of vehicle in feet per second;lane= lane
>>>>>number in which vehicle is present in a particular frame;
>>>>>
>>>>>'frame' number can also repeat e.g. in frame 120 the example data
>>>>shows
>>>>>vehicle 2 was observed but in the original data many more vehicles
>>>>>might have been observed in this frame. Similarly, 'class' is
>defined
>>>>>above and all three classes are present in the original data (here
>>>>>example data only shows classes 2 and 3 i.e. cars and trucks).
>>>>>I need to determine two things:1) Number of vehicles observed in
>every
>>>>>30 seconds i.e. 300 frames 2) Average velocity of each vehicle
>class
>>>>in
>>>>>every 30 seconds
>>>>>> This means that the first step might be to determine the minimum
>and
>>>>>maximum frame numbers and then divide them in slots so that every
>slot
>>>>>has 300 frames. In my original data I found 22 as min and 9233 as
>max
>>>>>frame number. This makes 30 time slots as 22-322, 322-622, ...,
>>>>>9022-9233. I need following columns in one table as an output (note
>>>>>that Timeslot column should contain the time intervals as described
>>>>>before): TimeSlot, Total-Cars, Total-Trucks, Total-Motorcycles,
>>>>>MeanVelocity-Cars, MeanVelocity-Trucks, MeanVelocity-Motorcycles
>>>>>
>>>>>
>>>>>
>>>>>      [[alternative HTML version deleted]]
>>>>>
>>>>>______________________________________________
>>>>>R-help at r-project.org mailing list
>>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>PLEASE do read the posting guide
>>>>>http://www.R-project.org/posting-guide.html
>>>>>and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>______________________________________________
>>>>R-help at r-project.org mailing list
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide
>>>>http://www.R-project.org/posting-guide.html
>>>>and provide commented, minimal, self-contained, reproducible code.
>>>




More information about the R-help mailing list