[R] templated use of aggregate

Wed Jun 13 16:28:16 CEST 2012

thank you for your patience. i assure you i will get better with the
appropriate etiquette - and hopefully eventually contribute.

On 13 June 2012 16:18, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Jun 13, 2012, at 10:09 AM, Matthew Johnson wrote:
>
>> my sessioninfo was as follows:
>>
>>> sessionInfo()
>>
>> R version 2.14.1 (2011-12-22)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods
>> [7] base
>>
>> other attached packages:
>> [1] xts_0.8-2 zoo_1.7-6
>>
>> loaded via a namespace (and not attached):
>> [1] grid_2.14.1    lattice_0.20-0
>>
>>
>> i have now updated to R 2.15, and my session info is:
>>
>>> sessionInfo()
>>
>> R version 2.15.0 (2012-03-30)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods
>> [7] base
>>
>> other attached packages:
>> [1] xts_0.8-6 zoo_1.7-7
>>
>> loaded via a namespace (and not attached):
>> [1] grid_2.15.0    lattice_0.20-6 tools_2.15.0
>>
>> For the XTS object mn your suggestion still fails with an error:
>>
>>> adf <- aggregate(mn[,-1]~mn[,1], data=mn, sum); adf
>>
>> Error in aggregate.formula(mn[, -1] ~ mn[, 1], data = mn, sum) :
>>  'names' attribute [3] must be the same length as the vector [1]
>>
>> however when i convert to a zoo with
>>
>>> mnz <- as.zoo(mn)
>>
>>
>> I get some errors, but it works
>
>
> Those are warnings, ... not errors.
>
>
>>
>>> adf <- aggregate(mnz[,-1]~mnz[,1], data=mnz, sum); adf
>>
>> Warning messages:
>> 1: In zoo(rval, index(x)[i]) :
>>  some methods for “zoo” objects do not work if the index entries in
>> ‘order.by’ are not unique
>> 2: In zoo(rval, index(x)[i]) :
>>  some methods for “zoo” objects do not work if the index entries in
>> ‘order.by’ are not unique
>> 3: In zoo(rval[i], index(x)[i]) :
>>  some methods for “zoo” objects do not work if the index entries in
>> ‘order.by’ are not unique
>> 4: In zoo(rval[i], index(x)[i]) :
>>  some methods for “zoo” objects do not work if the index entries in
>> ‘order.by’ are not unique
>> 5: In zoo(xc[ind], ix[ind]) :
>>  some methods for “zoo” objects do not work if the index entries in
>> ‘order.by’ are not unique
>> 6: In zoo(xc[ind], ix[ind]) :
>>  some methods for “zoo” objects do not work if the index entries in
>> ‘order.by’ are not unique
>>  mnz[, 1] mnz[, -1]
>> 1    97.90       408
>> 2    97.89       208
>>
>> So is this a bug in XTS?
>
>
> It does look that way to me. The correct way to report this is to contact
> the package maintainer (copied on this message) , (although I did notice
> that Joshua Ulrich already looked at this posting in SO and he is on the xts
> development team). You should have put in this at the beginning of your code
> :
>
> library(xts)
>
> --
> David.
>
>>
>> thanks for your patience
>>
>> mj
>>
>> On 13 June 2012 15:53, David Winsemius <dwinsemius at comcast.net> wrote:
>>>
>>>
>>> On Jun 13, 2012, at 9:38 AM, Matthew Johnson wrote:
>>>
>>>> Sorry, i'll try and put more flesh on the bones.
>>>>
>>>> please note, i changed the data in the example, as fiddling has raised
>>>> another question that's best illustrated with a slightly different
>>>> data set.
>>>>
>>>> first of all, when i do as you suggest, i obtain the following error:
>>>>
>>>>> PxMat <- aggregate(mm[,-1] ~ mm[,1], data=mm, sum)
>>>>
>>>>
>>>>
>>>> Error in aggregate.formula(mm[, -1] ~ mm[, 1], data = mm, sum) :
>>>>  'names' attribute [3] must be the same length as the vector [1]
>>>
>>>
>>>
>>> Very strange. When I just did it with the structure you (cross-) posted
>>> on
>>> SO I got:
>>>
>>>> adf <- aggregate(mm[,-1]~mm[,1], data=mm, sum); adf
>>>
>>> snipped warning messages
>>>  mm[, 1] mm[, -1]
>>>
>>> 1   97.91      538
>>> 2   97.92      918
>>>
>>> I had earlier tested it with a zoo object I had constructed and did it
>>> again
>>> with the structure below.
>>>
>>>  mm[, 1] mm[, -1]
>>>
>>> 1   97.91      538
>>> 2   97.92      918
>>>
>>> I'm using zoo_1.7-6 and R version 2.14.2 on a Mac. I do not remember you
>>> posting the requested information about your versions.
>>>
>>> --
>>> David.
>>>
>>>
>>>>
>>>> my data.frame is an xts, and it looks like this:
>>>>
>>>>                   px_ym1 vol_ym1
>>>> 2012-06-01 09:30:00  97.90       9
>>>> 2012-06-01 09:30:00  97.90      60
>>>> 2012-06-01 09:30:00  97.90      71
>>>> 2012-06-01 09:30:00  97.90       5
>>>> 2012-06-01 09:30:00  97.90       3
>>>> 2012-06-01 09:30:00  97.90      21
>>>> 2012-06-01 09:31:00  97.90       5
>>>> 2012-06-01 09:31:00  97.89     192
>>>> 2012-06-01 09:31:00  97.89      65
>>>> 2012-06-01 09:31:00  97.89      73
>>>> 2012-06-01 09:31:00  97.89       1
>>>> 2012-06-01 09:31:00  97.89       1
>>>> 2012-06-01 09:31:00  97.89      39
>>>> 2012-06-01 09:31:00  97.90      15
>>>> 2012-06-01 09:31:00  97.90       1
>>>> 2012-06-01 09:31:00  97.89       1
>>>> 2012-06-01 09:31:00  97.90      18
>>>> 2012-06-01 09:31:00  97.89       1
>>>> 2012-06-01 09:32:00  97.89      33
>>>> 2012-06-01 09:34:00  97.89       1
>>>> 2012-06-01 09:34:00  97.89       1
>>>>
>>>> dput(mn) returns:
>>>>
>>>>> dput(mn)
>>>>
>>>>
>>>> structure(c(97.9, 97.9, 97.9, 97.9, 97.9, 97.9, 97.9, 97.89,
>>>> 97.89, 97.89, 97.89, 97.89, 97.89, 97.9, 97.9, 97.89, 97.9, 97.89,
>>>> 97.89, 97.89, 97.89, 9, 60, 71, 5, 3, 21, 5, 192, 65, 73, 1,
>>>> 1, 39, 15, 1, 1, 18, 1, 33, 1, 1), .indexCLASS = c("POSIXct",
>>>> "POSIXt"), .indexTZ = "GMT", class = c("xts", "zoo"), index =
>>>> structure(c(1338543000,
>>>> 1338543000, 1338543000, 1338543000, 1338543000, 1338543000, 1338543060,
>>>> 1338543060, 1338543060, 1338543060, 1338543060, 1338543060, 1338543060,
>>>> 1338543060, 1338543060, 1338543060, 1338543060, 1338543060, 1338543120,
>>>> 1338543240, 1338543240), tzone = "GMT", tclass = c("POSIXct",
>>>> "POSIXt")), .Dim = c(21L, 2L), .Dimnames = list(NULL, c("px_ym1",
>>>> "vol_ym1")))
>>>>
>>>> as you can see, the xts data.frame xts data.frame that contains dates,
>>>> prices and volumes. There is much more data over a long time period,
>>>> and i'm interested in various sub-setting and then aggregate
>>>> operations.
>>>>
>>>> I would like to split the data by time period and aggregate the data,
>>>> such that i obtain a table which reports the volume traded at each
>>>> price, for each of the time-period splits that i have chosen.
>>>>
>>>> I have employed the following approach:
>>>>
>>>> PxMat <- aggregate(.~px_ym1, data=mn, sum)
>>>>
>>>>
>>>> which yields:
>>>>
>>>>  px_ym1 vol_ym1
>>>> 1  97.89     408
>>>> 2  97.90     208
>>>>
>>>> and for subsets, i use the following grouping:
>>>>
>>>>> PxMat30 <- aggregate(.~px_ym1, data=mn[.indexmin(mn) == '30'], sum)
>>>>
>>>>
>>>>
>>>> Which yields:
>>>>
>>>>  px_ym1 vol_ym1
>>>> 1   97.9     169
>>>>
>>>> and
>>>>
>>>>> PxMat31 <- aggregate(.~px_ym1, data=mn[.indexmin(mn) == '31'], sum)
>>>>
>>>>
>>>>
>>>> which yields:
>>>>
>>>>  px_ym1 vol_ym1
>>>> 1  97.89     373
>>>> 2  97.90      39
>>>>
>>>> and so on and so forth for each minute.
>>>>
>>>> when i try and sub-set using general notation, as follows:
>>>>
>>>> PxMat <- aggregate(.~mn[,1], data=mn, sum)
>>>>
>>>> this yields a different form of output:
>>>>
>>>> px_ym1  px_ym1 vol_ym1
>>>> 1  97.90 1076.79     408
>>>> 2  97.89  979.00     208
>>>>
>>>> the problem is that i now have the sum of the px_ym1 data (the sum of
>>>> mn[,1])
>>>>
>>>> hopefully things are now clearer - sorry to have wasted your time up
>>>> until now.
>>>>
>>>> assuming that i have now made my situation clear, i am hope you can
>>>> help with four specific questions.
>>>>
>>>> 1/ My data-sets are HUGE, so speed is an issue - is this the fastest
>>>> way to sub-set and aggregate an xts?
>>>>
>>>> 2/ is there a way to do this for multiple splits? say a table for each
>>>> minute, day, week, or month? the return would potentially be a list
>>>> with a table for each day / minute etc showing volume traded at each
>>>> price -- but it doesn't have to be a list ...
>>>>
>>>> i am writing a function with loops that would generate a table that
>>>> reports volume traded at each price for each case of a specified time
>>>> split (say for four tables, one for each minute in the example data,
>>>> returned as a list). my solution is slow, it seems like something that
>>>> someone would have done better already. is this the case?
>>>>
>>>> 3/ is there a way to do the sub-setting with templated variables? i
>>>> would like to obtain the table i get with the named aggregate
>>>> functions (reproduced above) with multiple data frames, as the column
>>>> names will differ from time to time. i cannot figure out how to stop
>>>> the command from summing the mn[,1] column when i stop using variable
>>>> names.
>>>>
>>>> 4/ on a related note, is it possible to apply different functions to
>>>> different columns of data? It would be nice, for example, if the table
>>>> returned from an aggregate command could be made to be:
>>>>
>>>> px_ym1  count vol_ym1
>>>> 1  97.90  11     408
>>>> 2  97.89  10     208
>>>>
>>>> where we have the price traded, the number of trades (a count of
>>>> px_ym1 / mn[,1], and the sum of vol_ym1 (mn[,2]).
>>>>
>>>> thanks and best regards
>>>>
>>>> matt johnson
>>>>
>>>> On 13 June 2012 15:06, David Winsemius <dwinsemius at comcast.net> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Jun 12, 2012, at 11:32 PM, Matthew Johnson wrote:
>>>>>
>>>>>> Dear R-help,
>>>>>>
>>>>>> I have an xts data set that i have subset by date.
>>>>>>
>>>>>> now it contains a date-time-stamp, and two columns (price and volume
>>>>>> traded): my objective is to create tables of volume traded at a price
>>>>>> -
>>>>>> and
>>>>>> i've been successfully using aggregate to do so in interactive use.
>>>>>>
>>>>>> say the data looks as follows:
>>>>>>
>>>>>>                 px_ym1 vol_ym1
>>>>>> 2012-06-01 09:37:00  97.91     437
>>>>>> 2012-06-01 09:37:00  97.91      64
>>>>>> 2012-06-01 09:37:00  97.91       1
>>>>>> 2012-06-01 09:37:00  97.91       5
>>>>>> 2012-06-01 09:37:00  97.91       5
>>>>>> 2012-06-01 09:37:00  97.92     174
>>>>>> 2012-06-01 09:37:00  97.92      64
>>>>>> 2012-06-01 09:37:00  97.92     125
>>>>>> 2012-06-01 09:37:00  97.92     124
>>>>>> 2012-06-01 09:37:00  97.92      64
>>>>>> 2012-06-01 09:37:00  97.92     109
>>>>>> 2012-06-01 09:37:00  97.92      64
>>>>>> 2012-06-01 09:37:00  97.92      19
>>>>>> 2012-06-01 09:37:00  97.92      45
>>>>>> 2012-06-01 09:37:00  97.92      75
>>>>>> 2012-06-01 09:37:00  97.92       3
>>>>>> 2012-06-01 09:37:00  97.92      47
>>>>>> 2012-06-01 09:37:00  97.91      26
>>>>>> 2012-06-01 09:37:00  97.92       4
>>>>>> 2012-06-01 09:37:00  97.92       1
>>>>>>
>>>>>> the the following gives me what i'm looking for:
>>>>>>
>>>>>>> adf <- aggregate(.~px_ym1, data=mm, sum)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> which is this table:
>>>>>>
>>>>>> px_ym1 vol_ym1
>>>>>> 1  97.91     538
>>>>>> 2  97.92     918
>>>>>>
>>>>>> however now i'm trying to code it to run automatically, and use of the
>>>>>> templated version:
>>>>>>
>>>>>>> adf <- aggregate(.~mm[,1], data=mm, sum)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Did you try:
>>>>>
>>>>> adf <- aggregate(mm[,-1] ~ mm[,1], data=mm, sum)
>>>>> adf
>>>>>
>>>>> I would have used names:
>>>>>
>>>>> adf <- aggregate(vol_ym1 ~ px_ym1, data=mm, sum)
>>>>> adf
>>>>>
>>>>>
>>>>>
>>>>>> yields the following - which contains what i'd like, but is has also
>>>>>> summed
>>>>>> across the price column (not ideal).
>>>>>>
>>>>>>  px_ym1  px_ym1 vol_ym1
>>>>>> 1  97.91  587.46     538
>>>>>> 2  97.92 1370.88     918
>>>>>>
>>>>>> how do i code this so that i can enter an xts data-frame with
>>>>>> arbitrary
>>>>>> names and still obtain the table with only the information i desire?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> That is too far to the vague side of the vague-specific continuum.
>>>>>
>>>>>
>>>>>>
>>>>>> on a related point, is there a way to combine the two steps?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Er, which two steps would that be?
>>>>>
>>>>>
>>>>>> the function
>>>>>> i've written splits by date and then returns a list containing
>>>>>> data-frames
>>>>>> that report the volume traded at each price on each date
>>>>>>
>>>>>> - am i re-creating the wheel here? is there canned function that does
>>>>>> this?
>>>>>>
>>>>>> thanks + best regards
>>>>>>
>>>>>> matt johnson
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> David Winsemius, MD
>>>>> West Hartford, CT
>>>>>
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>
> David Winsemius, MD
> West Hartford, CT
>