[R] Number of used observations in stats::arima with NAs

Wed Nov 2 22:48:00 CET 2016

When missing observations (NAs) are present in the data and the regular or differencing filters are applied in 
stats::arima, the number of used observations (nobs) reported by nobs() is bigger than the actual number of non-NA observations. This value is used to compute the covariance matrix of parameter estimates and by BIC(), so this is relevant for model selection.

Example: the same number of observations are used in both fits below, but nobs() reports different values:

    x <- round(log(AirPassengers),2)
    length(x)
    #[1] 144
    x[c(36,52)] <- NA
    # fit ARIMA(0,1,0) to the original series
    fit1 <- stats::arima(x, order=c(0,1,0), include.mean=FALSE, method="ML") 
    nobs(fit1)
    #[1] 141
    # fit ARIMA(0,0,0) to the differenced series
    fit2 <- stats::arima(diff(x), order=c(0,0,0), include.mean=FALSE, method="ML") 
    nobs(fit2)

    #[1] 139

In "stats::arima", nobs is obtained as:

    n.used <- sum(!is.na(x)) - length(Delta)

where "length(Delta)" is the number of observations missed due to differencing. The original data in "x" are used instead of the differenced data. This ignores the fact that a missing observation generates two NAs in the differenced series:
    diff(c(1,2,3,NA,5,6,7,8))

    #[1]  1  1 NA NA  1  1  1

"stats::arima" counts the NAs in the original data, but this does not always match the number of NAs in the differenced series, which is the series that is used to compute the log-likelihood. Although there is only *one* NA observation, the contribution to the log-likelihood of *two* observations will be missed due to this NA, hence the number of contributions to the likelihood is not "sum(!is.na(x)) - length(Delta)" but "sum(!is.na(diff(x)))".

The more NAs are in the data, the bigger will be the effect of this issue.

If the argument "order[2]=1", "seasonal$order[2]=0" and the NA is the first observation, then my comment above is not an issue. It is not either an issue when "seasonal$order[2]=1" and the NAs are in the first "seasonal$period" observations. But in general, I believe the way nobs is computed requires adjusting.
I would propose (reusing code from stats::arima):

    dx <- x
    if(d > 0L) {
      dx <- diff(dx, 1L, order[2L])
    }

    if(seasonal$period > 1L & seasonal$order[2L] > 0) {
      dx <- diff(dx, seasonal$period, seasonal$order[2L])    }
    n.used <- sum(!is.na(dx))

If the argument "xreg" is NULL, "dx" is not needed and the following may be more efficient than the above option (I checked it with just a few examples, so the above may be a safer option):

    d <- order[2L]
    D <- seasonal$order[2L]

    S <- seasonal$period
    n.used <- length(x) - sum(is.na(x)[seq_len(S)]) -       2*(d + D) * sum(is.na(x)[-seq_len(S)]) - length(Delta)

If "xreg" is not NULL, "dx" is required in other part of the code, so it is convenient to use it here as well; replacing "x" by "dx" and "xreg" by "dxreg":

    isna <- is.na(dx) | apply(dxreg, 1L, anyNA)
    n.used <- sum(!isna)

Are there any reasons for counting the number of available observations in the original series rather than in the differenced series? Should this be fixed?

Thanks

Javier López-de-Lacalle
https://jalobe.com