[Rd] [R] Calculation of cross-correlation in ccf

Tue Nov 4 09:44:36 CET 2014

Dear All,

I am studying some process measurement time series in R and trying to identify time delays using cross-correlation function ccf. The results have however been bit confusing. I found a couple of years old message about this issue but unfortunately wasn't able to find it again for a reference.

For example, an obvious time shift is observed between the measurements y1 and y2 when the following test data is plotted:

x <- 1:121
y1 <- c(328.27, 328.27, 328.27, 328.27, 328.21, 328.14, 328.14, 328.01,
  328.07, 328.01, 327.87, 328.01, 328.07, 328.27, 328.27, 328.54, 328.61,
  328.74, 328.88, 329.01, 329.01, 329.21, 329.28, 329.35, 329.35, 329.42,
  329.35, 329.28, 329.28, 329.15, 329.08, 329.08, 328.95, 328.95, 328.95,
  328.95, 329.01, 329.15, 329.21, 329.28, 329.55, 329.62, 329.75, 329.82,
  329.89, 330.09, 330.09, 330.29, 330.29, 330.36, 330.42, 330.29, 330.15,
  330.22, 330.09, 329.95, 329.82, 329.75, 329.62, 329.55, 329.48, 329.55,
  329.68, 329.75, 329.82, 329.89, 330.09, 330.09, 330.15, 330.22, 330.42,
  330.42, 330.42, 330.36, 330.42, 330.22, 330.15, 330.09, 329.89, 329.75,
  329.55, 329.35, 329.35, 329.42, 329.48, 329.55, 329.75, 329.75, 329.82,
  330.09, 330.15, 330.42, 330.42, 330.62, 330.69, 330.69, 330.83, 330.83,
  330.76, 330.62, 330.62, 330.56, 330.42, 330.42, 330.29, 330.29, 330.29,
  330.29, 330.22, 330.49, 330.56, 330.62, 330.76, 331.03, 330.96, 331.16,
  331.23, 331.50, 331.63, 332.03, 332.03)
y2 <- c(329.68, 329.75, 329.82, 329.95, 330.02, 330.15, 330.22, 330.36,
  330.22, 330.29, 330.29, 330.29, 330.29, 330.15, 330.22, 330.22, 330.15,
  330.15, 330.15, 330.15, 330.15, 330.29, 330.49, 330.49, 330.62, 330.89,
  331.03, 331.09, 331.16, 331.30, 331.30, 331.36, 331.43, 331.43, 331.43,
  331.36, 331.36, 331.36, 331.36, 331.23, 331.23, 331.16, 331.16, 331.23,
  331.30, 331.23, 331.36, 331.56, 331.70, 331.83, 331.97, 331.97, 332.10,
  332.30, 332.44, 332.44, 332.51, 332.51, 332.57, 332.57, 332.51, 332.37,
  332.24, 332.24, 332.10, 331.97, 331.97, 331.90, 331.83, 331.97, 331.97,
  331.97, 332.03, 332.24, 332.30, 332.30, 332.37, 332.57, 332.57, 332.57,
  332.57, 332.57, 332.51, 332.37, 332.30, 332.17, 331.97, 331.83, 331.70,
  331.70, 331.63, 331.63, 331.70, 331.83, 331.90, 332.10, 332.24, 332.30,
  332.44, 332.57, 332.71, 332.84, 332.77, 332.91, 332.84, 332.84, 332.91,
  332.84, 332.77, 332.77, 332.64, 332.64, 332.57, 332.57, 332.57, 332.57,
  332.57, 332.71, 332.98, 333.24, 333.58)
matplot(cbind(y1, y2))

However, the cross-correlation function surprisingly gives the maximum correlation 0.83 at zero lag:

ccf(y1, y2)

Plotting of variables against each other with zero lag

plot(y1, y2)

shows that the correlation is not that good. Instead, a very nice correlation is obtained with a plot with shifted variables:

plot(y1[1:113], y2[1:113 + 8])

As a comparison I defined my own cross correlation function:

cross.cor <- function(x, y, k)
{
  n <- length(x) - abs(k)
  if (k >= 0)
    cor(x[1:n + k], y[1:n])
  else
    cor(x[1:n], y[1:n - k])
}

The new function cross.cor gives maximum correlation of 0.99 at lag -8, and the shape of the correlation function is very different from the one obtained with ccf (both methods give same value at zero lag):

plot(-15:15, sapply(-15:15, function(lag) cross.cor(y1, y2, lag)),
  ylim = c(0.3, 1))
points(-15:15, ccf(y1, y2, 15, plot = FALSE)$acf, col = "red")

In order to understand the reason for the differences, I looked at the program codes of ccf function. When the variables are compared with a nonzero lag, some the data points must be left out from the tails of the time series. It appears to me that ccf calculates (partly in R code, partly in C code) first the means and the standard deviations using the whole data, and then uses those values as constants in further calculations. The time series are truncated only in the summations for covariances.

That approach naturally speeds up the computations, but is it really correct? Is the approach used in ccf somehow statistically more correct? I suppose the strong increasing trend in my test data emphasizes this issue (leaving data points out from one end has big impact on the average).

Best regards,
Sami Toppinen
sami.toppinen at kolumbus.fi