[Rd] Fix for bug in arima function
Martin Maechler
maechler at lynne.stat.math.ethz.ch
Thu May 21 12:49:09 CEST 2015
>>>>> peter dalgaard <pdalgd at gmail.com>
>>>>> on Thu, 21 May 2015 11:03:05 +0200 writes:
> On 21 May 2015, at 10:35 , Martin Maechler <maechler at lynne.stat.math.ethz.ch> wrote:
>>>
>>> I noticed that the 3.2.1 release cycle is about to start. Is there any
>>> chance that this fix will make it into the next version of R?
>>>
>>> This bug is fairly serious: getting the wrong variance estimate leads to
>>> the wrong log-likelihood and the wrong AIC, BIC etc, which can and does
>>> lead to suboptimal model selection. If it's not fixed, this issue will
>>> affect every student taking our time series course in Fall 2015 (and
>>> probably lots of other students in other time series courses). When I
>>> taught time series in Spring 2015, I had to teach students how to work
>>> around the bug, which wasted class time and shook student confidence in R.
>>> It'd be great if we didn’t have to deal with this issue next semester.
>>>
>>> Again, the fix is trivial:
>>>
>>> --- a/src/library/stats/R/arima.R
>>> +++ b/src/library/stats/R/arima.R
>>> @@ -211,8 +211,10 @@ arima <- function(x, order = c(0L, 0L, 0L),
>>> if(fit$rank == 0L) {
>>> ## Degenerate model. Proceed anyway so as not to break old code
>>> fit <- lm(x ~ xreg - 1, na.action = na.omit)
>>> + n.used <- sum(!is.na(resid(fit))) - length(Delta)
>>> + } else {
>>> + n.used <- sum(!is.na(resid(fit)))
>>> }
>>> - n.used <- sum(!is.na(resid(fit))) - length(Delta)
>>> init0 <- c(init0, coef(fit))
>>> ses <- summary(fit)$coefficients[, 2L]
>>> parscale <- c(parscale, 10 * ses)
>>>
>>
>> Yes, such a change *is* small in the source code.
>> But we have to be sure about its desirability.
>>
>> In another post about this you mention "REML", and I think we
>> really are discussing if variance estimates should use a
>> denominator of 'n' or 'n - p' in this case.
>>
>>
>>> The patch that introduced the bug (
>>> https://github.com/wch/r-source/commit/32f633885a903bc422537dc426644f743cc645e0
>>> ) was designed to change the initialization for the optimization routine.
>>
>>> The proposed fix leaves the deliberate part of the patch unchanged (it
>>> preserves the value of "init0").
>>
>> I can confirm this... a change introduced in R 3.0.2.
>>
>> I'm about to commit changes ... after also adding a proper
>> regression test.
>>
> Be careful here! I was just about to say that the diagnosis is dubious, and that the patch could very well be wrong!!
> AFAICT, the issue is that n.used got changed from being based on lm(x~...) to lm(dx~...) where dx is the differenced series. Now that surely loses one observation in arima(.,1,.), most likely unintentionally, but it is not at all clear that the fix is not to subtract length(Delta) -- that code has been there long before the changes in 3.0.2.
well... yes, but as you say for the case of the original lm()
fit where the resulting residuals and hence is.na(resid(.)) have
been longer....
> I'd expect that a safer fix would be to add back the orders of the the two differencing operations.
What I did check before replying is that the patch *does* revert to 'R <= 3.0.1'
behavior for simple 'xreg' cases.
I do see changes in the S.Es of the regression coefficients, as
they are expected.
The few cases I've looked at where all giving results compatible
with R <= 3.0.1 (or the bug triggered which was fixed in R 3.0.2),
but I am happy for other examples where the
degrees of freedom should be computed differently, e.g., by
taking account the differencing orders as you suggest.
Seeing how relatively easy it still is to get the internal call
to optim() to produce an error, I do wonder if there are such
extensively tested arima(*, xreg = .) examples.
If we do not get more suggestions here, I'd like to commit to
R-devel only. This would still not mean that this is going to
be in R 3.2.1 ... though it would be nice if others confirmed or
helped with more references.
Martin
More information about the R-devel
mailing list