[R] Possible artifacts in cross-correlation function ("ccf")?

Tim Dorscheidt tdorscheidt at gmail.com
Fri May 11 13:06:03 CEST 2012


Dear R-users,

I have been using R and its core-packages with great satisfaction now for many years, and have recently started using the "ccf" function (part of the "stats" package version 2.16.0), about which I have a question.

The "ccf"-algorithm for calculating the cross-correlation between two time series always calculates the mean and standard deviation per time series beforehand, thereby using a constant value for these irrespective of any time-lag. Another piece of statistical software that I'm using, a toolbox in Matlab, does this in a fundamentally different way. It first "chops off" the parts of the time-series that do not overlap when a time-lag has been introduced, and then calculates a new mean and standard deviation to be used for further calculations. This latter method has the advantage of always theoretically still being able to obtain a cross-correlation of 1 (or -1), whereas the "ccf"-method of R seems to introduce zeros at the non-overlapping parts of the time-series, thereby preventing this possibility and producing very different results. Take for instance the two time series: a = {1,3,2} and b = {3,2,1}. The query "ccf(a,b)" produces the output {-0.5, -0.5, 0.5}, but I would think that
 a time-lag of -1 should produce a cross-correlation here of 1, since the two time series will overlap with identical parts {3,2}.

I have attached clean implementations (removing all dependencies) of how the R algorithm seems to calculate cross-correlations with time-lag (it produces identical results to "ccf"), and how this other method (in Matlab) calculates it (with newly calculated means and standard deviation for each time-lag).

Could someone be so kind as to explain to me why the "ccf"-algorithm has this specific implementation that seems to, at least for specific situations, produce results with artifacts? It is very likely that the R-implementation, as opposed to the alternative algorithm described above and in the attachment, has a very good statistical explanation, but one that unfortunately is not dawning on me.

Sincerely,
Tim Dorscheidt





More information about the R-help mailing list