[R] templated use of aggregate

Wed Jun 13 15:53:00 CEST 2012

On Jun 13, 2012, at 9:38 AM, Matthew Johnson wrote:

> Sorry, i'll try and put more flesh on the bones.
>
> please note, i changed the data in the example, as fiddling has raised
> another question that's best illustrated with a slightly different
> data set.
>
> first of all, when i do as you suggest, i obtain the following error:
>
>> PxMat <- aggregate(mm[,-1] ~ mm[,1], data=mm, sum)
>
> Error in aggregate.formula(mm[, -1] ~ mm[, 1], data = mm, sum) :
>   'names' attribute [3] must be the same length as the vector [1]

Very strange. When I just did it with the structure you (cross-)  
posted on SO I got:

 > adf <- aggregate(mm[,-1]~mm[,1], data=mm, sum); adf
snipped warning messages
   mm[, 1] mm[, -1]
1   97.91      538
2   97.92      918

I had earlier tested it with a zoo object I had constructed and did it  
again with the structure below.

   mm[, 1] mm[, -1]
1   97.91      538
2   97.92      918

I'm using zoo_1.7-6 and R version 2.14.2 on a Mac. I do not remember  
you posting the requested information about your versions.

-- 
David.

>
> my data.frame is an xts, and it looks like this:
>
>                     px_ym1 vol_ym1
> 2012-06-01 09:30:00  97.90       9
> 2012-06-01 09:30:00  97.90      60
> 2012-06-01 09:30:00  97.90      71
> 2012-06-01 09:30:00  97.90       5
> 2012-06-01 09:30:00  97.90       3
> 2012-06-01 09:30:00  97.90      21
> 2012-06-01 09:31:00  97.90       5
> 2012-06-01 09:31:00  97.89     192
> 2012-06-01 09:31:00  97.89      65
> 2012-06-01 09:31:00  97.89      73
> 2012-06-01 09:31:00  97.89       1
> 2012-06-01 09:31:00  97.89       1
> 2012-06-01 09:31:00  97.89      39
> 2012-06-01 09:31:00  97.90      15
> 2012-06-01 09:31:00  97.90       1
> 2012-06-01 09:31:00  97.89       1
> 2012-06-01 09:31:00  97.90      18
> 2012-06-01 09:31:00  97.89       1
> 2012-06-01 09:32:00  97.89      33
> 2012-06-01 09:34:00  97.89       1
> 2012-06-01 09:34:00  97.89       1
>
> dput(mn) returns:
>
>> dput(mn)
> structure(c(97.9, 97.9, 97.9, 97.9, 97.9, 97.9, 97.9, 97.89,
> 97.89, 97.89, 97.89, 97.89, 97.89, 97.9, 97.9, 97.89, 97.9, 97.89,
> 97.89, 97.89, 97.89, 9, 60, 71, 5, 3, 21, 5, 192, 65, 73, 1,
> 1, 39, 15, 1, 1, 18, 1, 33, 1, 1), .indexCLASS = c("POSIXct",
> "POSIXt"), .indexTZ = "GMT", class = c("xts", "zoo"), index =
> structure(c(1338543000,
> 1338543000, 1338543000, 1338543000, 1338543000, 1338543000,  
> 1338543060,
> 1338543060, 1338543060, 1338543060, 1338543060, 1338543060,  
> 1338543060,
> 1338543060, 1338543060, 1338543060, 1338543060, 1338543060,  
> 1338543120,
> 1338543240, 1338543240), tzone = "GMT", tclass = c("POSIXct",
> "POSIXt")), .Dim = c(21L, 2L), .Dimnames = list(NULL, c("px_ym1",
> "vol_ym1")))
>
> as you can see, the xts data.frame xts data.frame that contains dates,
> prices and volumes. There is much more data over a long time period,
> and i'm interested in various sub-setting and then aggregate
> operations.
>
> I would like to split the data by time period and aggregate the data,
> such that i obtain a table which reports the volume traded at each
> price, for each of the time-period splits that i have chosen.
>
> I have employed the following approach:
>
> PxMat <- aggregate(.~px_ym1, data=mn, sum)
>
>
> which yields:
>
>   px_ym1 vol_ym1
> 1  97.89     408
> 2  97.90     208
>
> and for subsets, i use the following grouping:
>
>> PxMat30 <- aggregate(.~px_ym1, data=mn[.indexmin(mn) == '30'], sum)
>
> Which yields:
>
>   px_ym1 vol_ym1
> 1   97.9     169
>
> and
>
>> PxMat31 <- aggregate(.~px_ym1, data=mn[.indexmin(mn) == '31'], sum)
>
> which yields:
>
>  px_ym1 vol_ym1
> 1  97.89     373
> 2  97.90      39
>
> and so on and so forth for each minute.
>
> when i try and sub-set using general notation, as follows:
>
> PxMat <- aggregate(.~mn[,1], data=mn, sum)
>
> this yields a different form of output:
>
> px_ym1  px_ym1 vol_ym1
> 1  97.90 1076.79     408
> 2  97.89  979.00     208
>
> the problem is that i now have the sum of the px_ym1 data (the sum  
> of mn[,1])
>
> hopefully things are now clearer - sorry to have wasted your time up
> until now.
>
> assuming that i have now made my situation clear, i am hope you can
> help with four specific questions.
>
> 1/ My data-sets are HUGE, so speed is an issue - is this the fastest
> way to sub-set and aggregate an xts?
>
> 2/ is there a way to do this for multiple splits? say a table for each
> minute, day, week, or month? the return would potentially be a list
> with a table for each day / minute etc showing volume traded at each
> price -- but it doesn't have to be a list ...
>
> i am writing a function with loops that would generate a table that
> reports volume traded at each price for each case of a specified time
> split (say for four tables, one for each minute in the example data,
> returned as a list). my solution is slow, it seems like something that
> someone would have done better already. is this the case?
>
> 3/ is there a way to do the sub-setting with templated variables? i
> would like to obtain the table i get with the named aggregate
> functions (reproduced above) with multiple data frames, as the column
> names will differ from time to time. i cannot figure out how to stop
> the command from summing the mn[,1] column when i stop using variable
> names.
>
> 4/ on a related note, is it possible to apply different functions to
> different columns of data? It would be nice, for example, if the table
> returned from an aggregate command could be made to be:
>
> px_ym1  count vol_ym1
> 1  97.90  11     408
> 2  97.89  10     208
>
> where we have the price traded, the number of trades (a count of
> px_ym1 / mn[,1], and the sum of vol_ym1 (mn[,2]).
>
> thanks and best regards
>
> matt johnson
>
> On 13 June 2012 15:06, David Winsemius <dwinsemius at comcast.net> wrote:
>>
>>
>> On Jun 12, 2012, at 11:32 PM, Matthew Johnson wrote:
>>
>>> Dear R-help,
>>>
>>> I have an xts data set that i have subset by date.
>>>
>>> now it contains a date-time-stamp, and two columns (price and volume
>>> traded): my objective is to create tables of volume traded at a  
>>> price - and
>>> i've been successfully using aggregate to do so in interactive use.
>>>
>>> say the data looks as follows:
>>>
>>>                   px_ym1 vol_ym1
>>> 2012-06-01 09:37:00  97.91     437
>>> 2012-06-01 09:37:00  97.91      64
>>> 2012-06-01 09:37:00  97.91       1
>>> 2012-06-01 09:37:00  97.91       5
>>> 2012-06-01 09:37:00  97.91       5
>>> 2012-06-01 09:37:00  97.92     174
>>> 2012-06-01 09:37:00  97.92      64
>>> 2012-06-01 09:37:00  97.92     125
>>> 2012-06-01 09:37:00  97.92     124
>>> 2012-06-01 09:37:00  97.92      64
>>> 2012-06-01 09:37:00  97.92     109
>>> 2012-06-01 09:37:00  97.92      64
>>> 2012-06-01 09:37:00  97.92      19
>>> 2012-06-01 09:37:00  97.92      45
>>> 2012-06-01 09:37:00  97.92      75
>>> 2012-06-01 09:37:00  97.92       3
>>> 2012-06-01 09:37:00  97.92      47
>>> 2012-06-01 09:37:00  97.91      26
>>> 2012-06-01 09:37:00  97.92       4
>>> 2012-06-01 09:37:00  97.92       1
>>>
>>> the the following gives me what i'm looking for:
>>>
>>>> adf <- aggregate(.~px_ym1, data=mm, sum)
>>>
>>>
>>> which is this table:
>>>
>>> px_ym1 vol_ym1
>>> 1  97.91     538
>>> 2  97.92     918
>>>
>>> however now i'm trying to code it to run automatically, and use of  
>>> the
>>> templated version:
>>>
>>>> adf <- aggregate(.~mm[,1], data=mm, sum)
>>
>>
>> Did you try:
>>
>> adf <- aggregate(mm[,-1] ~ mm[,1], data=mm, sum)
>> adf
>>
>> I would have used names:
>>
>> adf <- aggregate(vol_ym1 ~ px_ym1, data=mm, sum)
>> adf
>>
>>
>>
>>> yields the following - which contains what i'd like, but is has  
>>> also summed
>>> across the price column (not ideal).
>>>
>>>  px_ym1  px_ym1 vol_ym1
>>> 1  97.91  587.46     538
>>> 2  97.92 1370.88     918
>>>
>>> how do i code this so that i can enter an xts data-frame with  
>>> arbitrary
>>> names and still obtain the table with only the information i desire?
>>
>>
>> That is too far to the vague side of the vague-specific continuum.
>>
>>
>>>
>>> on a related point, is there a way to combine the two steps?
>>
>>
>> Er, which two steps would that be?
>>
>>
>>> the function
>>> i've written splits by date and then returns a list containing  
>>> data-frames
>>> that report the volume traded at each price on each date
>>>
>>> - am i re-creating the wheel here? is there canned function that  
>>> does this?
>>>
>>> thanks + best regards
>>>
>>> matt johnson
>>
>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>

David Winsemius, MD
West Hartford, CT