[Rd] [R] Calculation of cross-correlation in ccf
Martyn Plummer
plummerm at iarc.fr
Wed Nov 5 11:50:03 CET 2014
The acf and ccf functions assume that time series are stationary, but yours are not.
I think that your alternative function is not well founded. You take a separate mean for each sub-series, which implicitly allows the mean of the series to vary arbitrarily with time. However, you only have one instance of each time series so you can't have a non-parametric model for the means.
A parametric approach is to remove a linear trend from the series and then apply ccf:
e1 <- residuals(lm(y1 ~ x))
e2 <- residuals(lm(y1 ~ x))
ccf(e1, e2)
which does identify a lag of -8 as the best match, although the correlation is somewhat lower than what you found (0.87).
Martyn
________________________________________
From: r-devel-bounces at r-project.org [r-devel-bounces at r-project.org] on behalf of Sami Toppinen [sami.toppinen at kolumbus.fi]
Sent: 04 November 2014 09:44
To: r-devel at r-project.org
Subject: [Rd] [R] Calculation of cross-correlation in ccf
Dear All,
I am studying some process measurement time series in R and trying to identify time delays using cross-correlation function ccf. The results have however been bit confusing. I found a couple of years old message about this issue but unfortunately wasn't able to find it again for a reference.
For example, an obvious time shift is observed between the measurements y1 and y2 when the following test data is plotted:
x <- 1:121
y1 <- c(328.27, 328.27, 328.27, 328.27, 328.21, 328.14, 328.14, 328.01,
328.07, 328.01, 327.87, 328.01, 328.07, 328.27, 328.27, 328.54, 328.61,
328.74, 328.88, 329.01, 329.01, 329.21, 329.28, 329.35, 329.35, 329.42,
329.35, 329.28, 329.28, 329.15, 329.08, 329.08, 328.95, 328.95, 328.95,
328.95, 329.01, 329.15, 329.21, 329.28, 329.55, 329.62, 329.75, 329.82,
329.89, 330.09, 330.09, 330.29, 330.29, 330.36, 330.42, 330.29, 330.15,
330.22, 330.09, 329.95, 329.82, 329.75, 329.62, 329.55, 329.48, 329.55,
329.68, 329.75, 329.82, 329.89, 330.09, 330.09, 330.15, 330.22, 330.42,
330.42, 330.42, 330.36, 330.42, 330.22, 330.15, 330.09, 329.89, 329.75,
329.55, 329.35, 329.35, 329.42, 329.48, 329.55, 329.75, 329.75, 329.82,
330.09, 330.15, 330.42, 330.42, 330.62, 330.69, 330.69, 330.83, 330.83,
330.76, 330.62, 330.62, 330.56, 330.42, 330.42, 330.29, 330.29, 330.29,
330.29, 330.22, 330.49, 330.56, 330.62, 330.76, 331.03, 330.96, 331.16,
331.23, 331.50, 331.63, 332.03, 332.03)
y2 <- c(329.68, 329.75, 329.82, 329.95, 330.02, 330.15, 330.22, 330.36,
330.22, 330.29, 330.29, 330.29, 330.29, 330.15, 330.22, 330.22, 330.15,
330.15, 330.15, 330.15, 330.15, 330.29, 330.49, 330.49, 330.62, 330.89,
331.03, 331.09, 331.16, 331.30, 331.30, 331.36, 331.43, 331.43, 331.43,
331.36, 331.36, 331.36, 331.36, 331.23, 331.23, 331.16, 331.16, 331.23,
331.30, 331.23, 331.36, 331.56, 331.70, 331.83, 331.97, 331.97, 332.10,
332.30, 332.44, 332.44, 332.51, 332.51, 332.57, 332.57, 332.51, 332.37,
332.24, 332.24, 332.10, 331.97, 331.97, 331.90, 331.83, 331.97, 331.97,
331.97, 332.03, 332.24, 332.30, 332.30, 332.37, 332.57, 332.57, 332.57,
332.57, 332.57, 332.51, 332.37, 332.30, 332.17, 331.97, 331.83, 331.70,
331.70, 331.63, 331.63, 331.70, 331.83, 331.90, 332.10, 332.24, 332.30,
332.44, 332.57, 332.71, 332.84, 332.77, 332.91, 332.84, 332.84, 332.91,
332.84, 332.77, 332.77, 332.64, 332.64, 332.57, 332.57, 332.57, 332.57,
332.57, 332.71, 332.98, 333.24, 333.58)
matplot(cbind(y1, y2))
However, the cross-correlation function surprisingly gives the maximum correlation 0.83 at zero lag:
ccf(y1, y2)
Plotting of variables against each other with zero lag
plot(y1, y2)
shows that the correlation is not that good. Instead, a very nice correlation is obtained with a plot with shifted variables:
plot(y1[1:113], y2[1:113 + 8])
As a comparison I defined my own cross correlation function:
cross.cor <- function(x, y, k)
{
n <- length(x) - abs(k)
if (k >= 0)
cor(x[1:n + k], y[1:n])
else
cor(x[1:n], y[1:n - k])
}
The new function cross.cor gives maximum correlation of 0.99 at lag -8, and the shape of the correlation function is very different from the one obtained with ccf (both methods give same value at zero lag):
plot(-15:15, sapply(-15:15, function(lag) cross.cor(y1, y2, lag)),
ylim = c(0.3, 1))
points(-15:15, ccf(y1, y2, 15, plot = FALSE)$acf, col = "red")
In order to understand the reason for the differences, I looked at the program codes of ccf function. When the variables are compared with a nonzero lag, some the data points must be left out from the tails of the time series. It appears to me that ccf calculates (partly in R code, partly in C code) first the means and the standard deviations using the whole data, and then uses those values as constants in further calculations. The time series are truncated only in the summations for covariances.
That approach naturally speeds up the computations, but is it really correct? Is the approach used in ccf somehow statistically more correct? I suppose the strong increasing trend in my test data emphasizes this issue (leaving data points out from one end has big impact on the average).
Best regards,
Sami Toppinen
sami.toppinen at kolumbus.fi
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel-----------------------------------------------------------------------
This message and its attachments are strictly confidenti...{{dropped:8}}
More information about the R-devel
mailing list