[R] question about using _apply and/or aggregate functions

Mon Jun 22 18:28:50 CEST 2009

On Jun 22, 2009, at 12:04 PM, Clifford Long wrote:

> Hi R-list,
>
> I'll apologize in advance for (1) the wordiness of my note (not sure
> how to avoid it) and (2) any deficiencies on my part that lead to my
> difficulties.
>
> I have an application with several stages that is meant to simulate
> and explore different scenarios with respect to product sales (in
> units sold per month).  My session info is at the bottom of this note.
>
> The steps include (1) an initial linear regression model, (2) building
> an ARIMA model, (3) creating 12 months of simulated 'real' data - for
> various values of projected changes in slope from the linear
> regression - from the ARIMA model using arima.sim with subsequent
> undifferencing and appending to historical data, (3) one-step-ahead
> forecasting for 12 months using the 'fixed' arima model and simulated
> data, (4) taking the residuals from the forecasting (simulated minus
> forecast for each of the 12 months) and applying QC charting, and (5)
> counting the number (if any) of runs-rules violations detected.
>
> The simulation-aspect calculates 100 simulations for each of 50  
> values of slope.
>
> All of this seems to work fine (although I'm sure that the coding
> could be improved through functions, vectorization, etc. ... ).
> However, the problem I'm having is at the end where I had hoped that
> things would be easier.  I want to summarize and graph the probability
> of detecting a runs-rule violation vs. the amount of the shift in
> slope (of logunit sales).
>
> The output data array passed from the qcc section at the end includes:
>  - the adjustment made to the slope (a multiplier)
>  - the actual value of the slope
>  - the iteration number of the simulation loop (within each value of  
> slope)
>  - the count of QC charting limits violations
>  - the count of QC charting runs rules violations
>
>
> My code is in the attached file ("generic_code.R) and my initial sales
> data needed to "prime" the simulation is in the other attached file
> ("generic_data.csv").  The relevant section of my code is at the
> bottom of the .R file after the end of the outer loop.  I've tried to
> provide meaningful comments.
>
> I've read the help files for _apply, aggregate, arrays and data types,
> and have also consulted with several texts (Maindonald and Braun;
> Spector; Venebles and Ripley for S-plus).  Somehow I still don't get
> it.  My attempts usually result in a message like the following:
>
>> agg.result = aggregate(simarray.part[,3], by=list4, FUN = mean)
> Error in FUN(X[[1L]], ...) : arguments must have same length

I cannot comment on the overall strategy, but wondered if this minor  
mod to the code might succeed;

>> agg.result = aggregate(simarray.part[,3], by=list(list4), FUN = mean)

My personal experience with aggregate() is not positive. I generally  
end up turning to tapply()  (which is at the heart of aggregate()  
anyway) probably because I forget to wrap the second argument in a  
list. Slow learner, I guess.

>
> But when I check the length of the arguments, they appear to match.  
> (??)
>
>> length(simarray.part[,3])
> [1] 5000
>> length(simarray.part[,4])
> [1] 5000
>> length(list4)
> [1] 5000
>
>
> I would have rather passed along a subset of the simulation/loop
> output dataset, but was unsure how to save it to preserve whatever
> properties that I may have missed that are causing my difficulties.
> If anyone still wants to help at this point, I believe that you'll
> need to load the .csv data and run the simulation (maybe reducing the
> number of iterations).
>
> Many thanks to anyone who can shed some light on my difficulties
> (whether self-induced or otherwise).
>
> Cliff
>
>
>
> I'm using a pre-compiled binary version of R for Windows.
>
> Session info:
>
>> sessionInfo()
> R version 2.9.0 (2009-04-17)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] qcc_1.3         forecast_1.24   tseries_0.10-18 zoo_1.5-5
> [5] quadprog_1.4-11
>
> loaded via a namespace (and not attached):
> [1] grid_2.9.0      lattice_0.17-22
>
>
>> Sys.getlocale()
> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT