[R] templated use of aggregate

Wed Jun 13 16:09:27 CEST 2012

Sorry about the cross posting - i didn't realise it was bad etiquette.

my sessioninfo was as follows:

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:
[1] xts_0.8-2 zoo_1.7-6

loaded via a namespace (and not attached):
[1] grid_2.14.1    lattice_0.20-0

i have now updated to R 2.15, and my session info is:

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:
[1] xts_0.8-6 zoo_1.7-7

loaded via a namespace (and not attached):
[1] grid_2.15.0    lattice_0.20-6 tools_2.15.0

For the XTS object mn your suggestion still fails with an error:

> adf <- aggregate(mn[,-1]~mn[,1], data=mn, sum); adf
Error in aggregate.formula(mn[, -1] ~ mn[, 1], data = mn, sum) :
  'names' attribute [3] must be the same length as the vector [1]

however when i convert to a zoo with

> mnz <- as.zoo(mn)

I get some errors, but it works

> adf <- aggregate(mnz[,-1]~mnz[,1], data=mnz, sum); adf
Warning messages:
1: In zoo(rval, index(x)[i]) :
  some methods for “zoo” objects do not work if the index entries in
‘order.by’ are not unique
2: In zoo(rval, index(x)[i]) :
  some methods for “zoo” objects do not work if the index entries in
‘order.by’ are not unique
3: In zoo(rval[i], index(x)[i]) :
  some methods for “zoo” objects do not work if the index entries in
‘order.by’ are not unique
4: In zoo(rval[i], index(x)[i]) :
  some methods for “zoo” objects do not work if the index entries in
‘order.by’ are not unique
5: In zoo(xc[ind], ix[ind]) :
  some methods for “zoo” objects do not work if the index entries in
‘order.by’ are not unique
6: In zoo(xc[ind], ix[ind]) :
  some methods for “zoo” objects do not work if the index entries in
‘order.by’ are not unique
  mnz[, 1] mnz[, -1]
1    97.90       408
2    97.89       208

So is this a bug in XTS?

thanks for your patience

mj

On 13 June 2012 15:53, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Jun 13, 2012, at 9:38 AM, Matthew Johnson wrote:
>
>> Sorry, i'll try and put more flesh on the bones.
>>
>> please note, i changed the data in the example, as fiddling has raised
>> another question that's best illustrated with a slightly different
>> data set.
>>
>> first of all, when i do as you suggest, i obtain the following error:
>>
>>> PxMat <- aggregate(mm[,-1] ~ mm[,1], data=mm, sum)
>>
>>
>> Error in aggregate.formula(mm[, -1] ~ mm[, 1], data = mm, sum) :
>>  'names' attribute [3] must be the same length as the vector [1]
>
>
> Very strange. When I just did it with the structure you (cross-) posted on
> SO I got:
>
>> adf <- aggregate(mm[,-1]~mm[,1], data=mm, sum); adf
> snipped warning messages
>  mm[, 1] mm[, -1]
>
> 1   97.91      538
> 2   97.92      918
>
> I had earlier tested it with a zoo object I had constructed and did it again
> with the structure below.
>
>  mm[, 1] mm[, -1]
>
> 1   97.91      538
> 2   97.92      918
>
> I'm using zoo_1.7-6 and R version 2.14.2 on a Mac. I do not remember you
> posting the requested information about your versions.
>
> --
> David.
>
>
>>
>> my data.frame is an xts, and it looks like this:
>>
>>                    px_ym1 vol_ym1
>> 2012-06-01 09:30:00  97.90       9
>> 2012-06-01 09:30:00  97.90      60
>> 2012-06-01 09:30:00  97.90      71
>> 2012-06-01 09:30:00  97.90       5
>> 2012-06-01 09:30:00  97.90       3
>> 2012-06-01 09:30:00  97.90      21
>> 2012-06-01 09:31:00  97.90       5
>> 2012-06-01 09:31:00  97.89     192
>> 2012-06-01 09:31:00  97.89      65
>> 2012-06-01 09:31:00  97.89      73
>> 2012-06-01 09:31:00  97.89       1
>> 2012-06-01 09:31:00  97.89       1
>> 2012-06-01 09:31:00  97.89      39
>> 2012-06-01 09:31:00  97.90      15
>> 2012-06-01 09:31:00  97.90       1
>> 2012-06-01 09:31:00  97.89       1
>> 2012-06-01 09:31:00  97.90      18
>> 2012-06-01 09:31:00  97.89       1
>> 2012-06-01 09:32:00  97.89      33
>> 2012-06-01 09:34:00  97.89       1
>> 2012-06-01 09:34:00  97.89       1
>>
>> dput(mn) returns:
>>
>>> dput(mn)
>>
>> structure(c(97.9, 97.9, 97.9, 97.9, 97.9, 97.9, 97.9, 97.89,
>> 97.89, 97.89, 97.89, 97.89, 97.89, 97.9, 97.9, 97.89, 97.9, 97.89,
>> 97.89, 97.89, 97.89, 9, 60, 71, 5, 3, 21, 5, 192, 65, 73, 1,
>> 1, 39, 15, 1, 1, 18, 1, 33, 1, 1), .indexCLASS = c("POSIXct",
>> "POSIXt"), .indexTZ = "GMT", class = c("xts", "zoo"), index =
>> structure(c(1338543000,
>> 1338543000, 1338543000, 1338543000, 1338543000, 1338543000, 1338543060,
>> 1338543060, 1338543060, 1338543060, 1338543060, 1338543060, 1338543060,
>> 1338543060, 1338543060, 1338543060, 1338543060, 1338543060, 1338543120,
>> 1338543240, 1338543240), tzone = "GMT", tclass = c("POSIXct",
>> "POSIXt")), .Dim = c(21L, 2L), .Dimnames = list(NULL, c("px_ym1",
>> "vol_ym1")))
>>
>> as you can see, the xts data.frame xts data.frame that contains dates,
>> prices and volumes. There is much more data over a long time period,
>> and i'm interested in various sub-setting and then aggregate
>> operations.
>>
>> I would like to split the data by time period and aggregate the data,
>> such that i obtain a table which reports the volume traded at each
>> price, for each of the time-period splits that i have chosen.
>>
>> I have employed the following approach:
>>
>> PxMat <- aggregate(.~px_ym1, data=mn, sum)
>>
>>
>> which yields:
>>
>>  px_ym1 vol_ym1
>> 1  97.89     408
>> 2  97.90     208
>>
>> and for subsets, i use the following grouping:
>>
>>> PxMat30 <- aggregate(.~px_ym1, data=mn[.indexmin(mn) == '30'], sum)
>>
>>
>> Which yields:
>>
>>  px_ym1 vol_ym1
>> 1   97.9     169
>>
>> and
>>
>>> PxMat31 <- aggregate(.~px_ym1, data=mn[.indexmin(mn) == '31'], sum)
>>
>>
>> which yields:
>>
>>  px_ym1 vol_ym1
>> 1  97.89     373
>> 2  97.90      39
>>
>> and so on and so forth for each minute.
>>
>> when i try and sub-set using general notation, as follows:
>>
>> PxMat <- aggregate(.~mn[,1], data=mn, sum)
>>
>> this yields a different form of output:
>>
>> px_ym1  px_ym1 vol_ym1
>> 1  97.90 1076.79     408
>> 2  97.89  979.00     208
>>
>> the problem is that i now have the sum of the px_ym1 data (the sum of
>> mn[,1])
>>
>> hopefully things are now clearer - sorry to have wasted your time up
>> until now.
>>
>> assuming that i have now made my situation clear, i am hope you can
>> help with four specific questions.
>>
>> 1/ My data-sets are HUGE, so speed is an issue - is this the fastest
>> way to sub-set and aggregate an xts?
>>
>> 2/ is there a way to do this for multiple splits? say a table for each
>> minute, day, week, or month? the return would potentially be a list
>> with a table for each day / minute etc showing volume traded at each
>> price -- but it doesn't have to be a list ...
>>
>> i am writing a function with loops that would generate a table that
>> reports volume traded at each price for each case of a specified time
>> split (say for four tables, one for each minute in the example data,
>> returned as a list). my solution is slow, it seems like something that
>> someone would have done better already. is this the case?
>>
>> 3/ is there a way to do the sub-setting with templated variables? i
>> would like to obtain the table i get with the named aggregate
>> functions (reproduced above) with multiple data frames, as the column
>> names will differ from time to time. i cannot figure out how to stop
>> the command from summing the mn[,1] column when i stop using variable
>> names.
>>
>> 4/ on a related note, is it possible to apply different functions to
>> different columns of data? It would be nice, for example, if the table
>> returned from an aggregate command could be made to be:
>>
>> px_ym1  count vol_ym1
>> 1  97.90  11     408
>> 2  97.89  10     208
>>
>> where we have the price traded, the number of trades (a count of
>> px_ym1 / mn[,1], and the sum of vol_ym1 (mn[,2]).
>>
>> thanks and best regards
>>
>> matt johnson
>>
>> On 13 June 2012 15:06, David Winsemius <dwinsemius at comcast.net> wrote:
>>>
>>>
>>>
>>> On Jun 12, 2012, at 11:32 PM, Matthew Johnson wrote:
>>>
>>>> Dear R-help,
>>>>
>>>> I have an xts data set that i have subset by date.
>>>>
>>>> now it contains a date-time-stamp, and two columns (price and volume
>>>> traded): my objective is to create tables of volume traded at a price -
>>>> and
>>>> i've been successfully using aggregate to do so in interactive use.
>>>>
>>>> say the data looks as follows:
>>>>
>>>>                  px_ym1 vol_ym1
>>>> 2012-06-01 09:37:00  97.91     437
>>>> 2012-06-01 09:37:00  97.91      64
>>>> 2012-06-01 09:37:00  97.91       1
>>>> 2012-06-01 09:37:00  97.91       5
>>>> 2012-06-01 09:37:00  97.91       5
>>>> 2012-06-01 09:37:00  97.92     174
>>>> 2012-06-01 09:37:00  97.92      64
>>>> 2012-06-01 09:37:00  97.92     125
>>>> 2012-06-01 09:37:00  97.92     124
>>>> 2012-06-01 09:37:00  97.92      64
>>>> 2012-06-01 09:37:00  97.92     109
>>>> 2012-06-01 09:37:00  97.92      64
>>>> 2012-06-01 09:37:00  97.92      19
>>>> 2012-06-01 09:37:00  97.92      45
>>>> 2012-06-01 09:37:00  97.92      75
>>>> 2012-06-01 09:37:00  97.92       3
>>>> 2012-06-01 09:37:00  97.92      47
>>>> 2012-06-01 09:37:00  97.91      26
>>>> 2012-06-01 09:37:00  97.92       4
>>>> 2012-06-01 09:37:00  97.92       1
>>>>
>>>> the the following gives me what i'm looking for:
>>>>
>>>>> adf <- aggregate(.~px_ym1, data=mm, sum)
>>>>
>>>>
>>>>
>>>> which is this table:
>>>>
>>>> px_ym1 vol_ym1
>>>> 1  97.91     538
>>>> 2  97.92     918
>>>>
>>>> however now i'm trying to code it to run automatically, and use of the
>>>> templated version:
>>>>
>>>>> adf <- aggregate(.~mm[,1], data=mm, sum)
>>>
>>>
>>>
>>> Did you try:
>>>
>>> adf <- aggregate(mm[,-1] ~ mm[,1], data=mm, sum)
>>> adf
>>>
>>> I would have used names:
>>>
>>> adf <- aggregate(vol_ym1 ~ px_ym1, data=mm, sum)
>>> adf
>>>
>>>
>>>
>>>> yields the following - which contains what i'd like, but is has also
>>>> summed
>>>> across the price column (not ideal).
>>>>
>>>>  px_ym1  px_ym1 vol_ym1
>>>> 1  97.91  587.46     538
>>>> 2  97.92 1370.88     918
>>>>
>>>> how do i code this so that i can enter an xts data-frame with arbitrary
>>>> names and still obtain the table with only the information i desire?
>>>
>>>
>>>
>>> That is too far to the vague side of the vague-specific continuum.
>>>
>>>
>>>>
>>>> on a related point, is there a way to combine the two steps?
>>>
>>>
>>>
>>> Er, which two steps would that be?
>>>
>>>
>>>> the function
>>>> i've written splits by date and then returns a list containing
>>>> data-frames
>>>> that report the volume traded at each price on each date
>>>>
>>>> - am i re-creating the wheel here? is there canned function that does
>>>> this?
>>>>
>>>> thanks + best regards
>>>>
>>>> matt johnson
>>>
>>>
>>>
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>
> David Winsemius, MD
> West Hartford, CT
>