[R] question about using _apply and/or aggregate functions

Tue Jun 23 01:55:45 CEST 2009

On Jun 22, 2009, at 6:16 PM, Clifford Long wrote:

> Hi David,
>
> I appreciate the advice.  I had coerced 'list4' to as.list, but forgot
> to specify "list=()" in the call to aggregate.  I made the correction,
> and now get the following:
>
>> slope.mult = simarray[,1]
>> adj.slope.value = simarray[,2]
>> adj.slope.level = simarray[,2]
>> qc.run.violation = simarray[,5]
>> simarray.part = cbind(slope.mult, adj.slope.value,  
>> qc.run.violation, adj.slope.level)
>> list4 = as.list(simarray.part[,4])
>> agg.result = aggregate(simarray.part[,3], by=list(list4), FUN = mean)
> Error in sort.list(unique.default(x), na.last = TRUE) :
>  'x' must be atomic for 'sort.list'
> Have you called 'sort' on a list?
>
> ... I'm not sure what this means that I've done wrong.  I did check
> 'list4' using "is.list", and get "TRUE" back as an answer, so feel
> that my mistake is some other fundamental aspect of R that I'm failing
> to grasp.
>
> To your note on 'tapply' ... I did try this as well (actually, tried
> it first) with no initial success.  On your recommendation, I gave
> tapply another go, and get something recognizable:
>
> vtt = tapply(simarray.part[,3], simarray.part[,2], mean)
>
>> dim(vtt)
> [1] 50
>> length(vtt)
> [1] 50
>> vtt[1:5]
> 0.003132 0.006264 0.009396 0.012528  0.01566
>    0.29     0.24     0.23     0.16     0.22
>> vtt[1]
> 0.003132
>    0.29
>> vtt[[1]][1]
> [1] 0.29
>
>
> I see that the output stored in "vtt" has one dimension with
> length=50.  But each place in vtt appears to hold two values.

Nope, that's just the output from an implicit print(vtt). vtt is an  
array with one row and an associated group of labels. If you doubt me  
(and I had some trouble with this myself) at that point, try  
is.matrix(vtt) and I predict will you get TRUE.

>  I'll
> admit that I'm not sure how to access/reference the entirety of all 50
> values =  0.29  0.24  0.23  0.16  0.22 (and so on).  I don't appear to
> be able to access/reference only what appears to be an embedded index
> = 0.003132   0.006264   0.009396  etc.   What am I missing?

names(vtt)

>  Is there
> a reference that I need to re-read?

?tapply

"Value
When FUN is present, tapply calls FUN for each cell that has any data  
in it. If FUN returns a single atomic value for each such cell (e.g.,  
functions mean or var) and when simplify is TRUE, tapplyreturns a  
multi-way array containing the values, and NA for the empty cells. "

> I would like to be able to plot
> one against the other.

plot(names(vtt), vtt)

>
> Thanks again for taking the time outside of your "day job" for your
> earlier reply!
>
> Cliff
>
>
> On Mon, Jun 22, 2009 at 11:28 AM, David Winsemius<dwinsemius at comcast.net 
> > wrote:
>>
>> On Jun 22, 2009, at 12:04 PM, Clifford Long wrote:
>>
>>> Hi R-list,
>>>
>>> I'll apologize in advance for (1) the wordiness of my note (not sure
>>> how to avoid it) and (2) any deficiencies on my part that lead to my
>>> difficulties.
>>>
>>> I have an application with several stages that is meant to simulate
>>> and explore different scenarios with respect to product sales (in
>>> units sold per month).  My session info is at the bottom of this  
>>> note.
>>>
>>> The steps include (1) an initial linear regression model, (2)  
>>> building
>>> an ARIMA model, (3) creating 12 months of simulated 'real' data -  
>>> for
>>> various values of projected changes in slope from the linear
>>> regression - from the ARIMA model using arima.sim with subsequent
>>> undifferencing and appending to historical data, (3) one-step-ahead
>>> forecasting for 12 months using the 'fixed' arima model and  
>>> simulated
>>> data, (4) taking the residuals from the forecasting (simulated minus
>>> forecast for each of the 12 months) and applying QC charting, and  
>>> (5)
>>> counting the number (if any) of runs-rules violations detected.
>>>
>>> The simulation-aspect calculates 100 simulations for each of 50  
>>> values of
>>> slope.
>>>
>>> All of this seems to work fine (although I'm sure that the coding
>>> could be improved through functions, vectorization, etc. ... ).
>>> However, the problem I'm having is at the end where I had hoped that
>>> things would be easier.  I want to summarize and graph the  
>>> probability
>>> of detecting a runs-rule violation vs. the amount of the shift in
>>> slope (of logunit sales).
>>>
>>> The output data array passed from the qcc section at the end  
>>> includes:
>>>  - the adjustment made to the slope (a multiplier)
>>>  - the actual value of the slope
>>>  - the iteration number of the simulation loop (within each value of
>>> slope)
>>>  - the count of QC charting limits violations
>>>  - the count of QC charting runs rules violations
>>>
>>>
>>> My code is in the attached file ("generic_code.R) and my initial  
>>> sales
>>> data needed to "prime" the simulation is in the other attached file
>>> ("generic_data.csv").  The relevant section of my code is at the
>>> bottom of the .R file after the end of the outer loop.  I've tried  
>>> to
>>> provide meaningful comments.
>>>
>>> I've read the help files for _apply, aggregate, arrays and data  
>>> types,
>>> and have also consulted with several texts (Maindonald and Braun;
>>> Spector; Venebles and Ripley for S-plus).  Somehow I still don't get
>>> it.  My attempts usually result in a message like the following:
>>>
>>>> agg.result = aggregate(simarray.part[,3], by=list4, FUN = mean)
>>>
>>> Error in FUN(X[[1L]], ...) : arguments must have same length
>>
>> I cannot comment on the overall strategy, but wondered if this  
>> minor mod to
>> the code might succeed;
>>
>>>> agg.result = aggregate(simarray.part[,3], by=list(list4), FUN =  
>>>> mean)
>>
>> My personal experience with aggregate() is not positive. I  
>> generally end up
>> turning to tapply()  (which is at the heart of aggregate() anyway)  
>> probably
>> because I forget to wrap the second argument in a list. Slow  
>> learner, I
>> guess.
>>
>>
>>>
>>> But when I check the length of the arguments, they appear to  
>>> match. (??)
>>>
>>>> length(simarray.part[,3])
>>>
>>> [1] 5000
>>>>
>>>> length(simarray.part[,4])
>>>
>>> [1] 5000
>>>>
>>>> length(list4)
>>>
>>> [1] 5000
>>>
>>>
>>> I would have rather passed along a subset of the simulation/loop
>>> output dataset, but was unsure how to save it to preserve whatever
>>> properties that I may have missed that are causing my difficulties.
>>> If anyone still wants to help at this point, I believe that you'll
>>> need to load the .csv data and run the simulation (maybe reducing  
>>> the
>>> number of iterations).
>>>
>>> Many thanks to anyone who can shed some light on my difficulties
>>> (whether self-induced or otherwise).
>>>
>>> Cliff
>>>
>>>
>>>
>>> I'm using a pre-compiled binary version of R for Windows.
>>>
>>> Session info:
>>>
>>>> sessionInfo()
>>>
>>> R version 2.9.0 (2009-04-17)
>>> i386-pc-mingw32
>>>
>>> locale:
>>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>> States.1252;LC_MONETARY=English_United
>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] qcc_1.3         forecast_1.24   tseries_0.10-18 zoo_1.5-5
>>> [5] quadprog_1.4-11
>>>
>>> loaded via a namespace (and not attached):
>>> [1] grid_2.9.0      lattice_0.17-22
>>>
>>>
>>>> Sys.getlocale()
>>>
>>> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>> States.1252;LC_MONETARY=English_United
>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
>>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT