[R] Moving window regressions - how can I improve this code?

Sat Apr 24 17:52:28 CEST 2004

Douglas Bates produced an extremely useful post so I hesitate to 
add more but here goes anyways. 

1. The bug I referred to in my previous post is still in the code.  
The last data point in women never gets used.  I think this
goes beyond just making a simple error but gets to the point of
how to avoid index arithmetic in the first place due to its
error prone nature and excess complexity.    Using embed()
or running() (the latter is in package gregmisc) would do that although
at the expense of more memory.

2. I find using stopifnot convenient for testing assertions.  There
is also a bug in the assertion code shown (width should be allowed 
to equal nr) and again I think its due to the introduction of complexity.  
The problem is that its harder to get it right when you have to invert 
the assertion condition.  I have recently discovered the R 
stopifnot function which allows one to state assertions without 
inverting them and now use it all the time.   Once you use stopifnot
the assertion becomes somewhat clearer since its not muddied by the
inversion and in my mind it starts to bring out a wider range of
considerations such as whether a width
with zero or fewer degrees of freedom should be allowed.  Perhaps
the condition should be width > length(all.vars(formula)) or if
formulae such as y~x-1 are allowed then a more complex specification.

With these changes (except I've kept width > 0), movingWindow becomes

movingWindow <- function(formula, data, width, ...) {
    nr <- nrow(data)
    width <- as.integer(width)[1]
    stopifnot( width > 0, width <= nr )
    indices <- as.data.frame( t( embed( 1:nr, width ) ) )
    lapply(indices, function(st) summary(lm(formula, data[st,])) )
}

Similar comments could be applied to movingWindow2.

Douglas Bates <bates <at> stat.wisc.edu> writes:

: 
: Ajay Shah <ajayshah <at> mayin.org> writes:
: 
: > I wrote a function which does "moving window" regressions. E.g. if
: > there are 100 observations and the window width is 50, then I first
: > run the regression for observations 1..50, then for 2..51, and so on.
: > 
: > I am extremely pleased with R in my experience with writing this,
: > since I was able to pass the model as an argument into the function
: >  Forgive me if I sound naive, but that's rocket science to me!!
: > 
: > For a regression with K explanatory variables, I make a matrix with
: > 2*K+2 columns, where I capture:
: >     K coefficients and K standard errors
: >     the residual sigma
: >     R^2
: > 
: > My code is:
: > 
: >    # ------------------------------------------------------------ 
: >    movingWindowRegression <- function(data, T, width, model, K) {
: >      results = matrix(nrow=T, ncol=2*K+2)
: >      for (i in width:T) {
: >        details <- summary.lm(lm(as.formula(model), data[(i-width+1):i,]))
: >        n=1;
: >        for (j in 1:K) {
: >          results[i, n]   = details$coefficients[j, 1]
: >          results[i, n+1] = details$coefficients[j, 2]
: >          n = n + 2
: >        }
: >        results[i, n] = details$sigma
: >        results[i, n+1] = details$r.squared
: >      }
: >      return(results)
: >    }
: >    
: >    # Simulate some data for a linear regression
: >    T = 20
: >    x = runif(T); y = 2 + 3*x + rnorm(T);
: >    D = data.frame(x, y)
: >    
: >    r = movingWindowRegression(D, T=T, width=10, model="y ~ x", K=2)
: >    print(r)
: >    # ------------------------------------------------------------ 
: > 
: > I would be very happy if you could look at this and tell me how to do
: > things better.
: 
: First, thanks for posting the code.  It takes courage to send your
: code to the list like this for public commentary.  However, questions
: like this provide instructive examples.
: 
: Some comments:
: 
:  As Gabor pointed out, there is a generic function nrow that can be
:  applied to data frames.
: 
:  As a matter of style, it is better to use
:     details <- summary(lm(model, data[<row range>,]))
:  That is, call the generic function "summary", not the specific name
:  of the method "summary.lm".  It is redundant to use summary.lm(lm(...))
:  and this construction is not guaranteed to continue to be valid.
: 
:  With regard to the looping, I would suggest using a list of summary
:  objects as the basic data structure within your function, then
:  extracting the pieces that you want from that list using lapply or sapply.
: 
:  That is, I would start with 
: 
: movingWindow <- function(formula, data, width, ...) {
:     nr = nrow(data)
:     width = as.integer(width)[1]
:     if (width <= 0 || width >= nr)
:         stop(paste("width must be in the range 1,...,", nr, sep=""))
:     nreg = nr - width
:     base = 0:(width - 1)
:     sumrys <- lapply(1:nreg,
:                      function(st) {
:                          summary(lm(formula, data[base + st,]))
:                      })
:     sumrys
: }
: 
:   I changed the argument name "model" to "formula" and changed the
:   order of the arguments to the standard order used in R modeling
:   functions.
: 
:   You may not want to use this form for very large data sets because
:   the raw summaries could be very large.  However this is a place to
:   start.
: 
:   The reason for creating a list is so you can use sapply and lapply to
:   extract results from the list.  To get the coefficients and standard
:   errors from a summary, use the coef generic and the select columns from
:   the result.  For example,
: 
: > data(women)
: > sumrys = movingWindow(weight ~ height, women, 9)
: > unlist(lapply(sumrys, function(sm) sm$sigma))  # sigma values
: [1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228
: > sapply(sumrys, "[[", "sigma")                  # same thing
: [1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228
: > sapply(sumrys, function(sm) coef(sm)[,1:2])
:              [,1]         [,2]         [,3]         [,4]        [,5]
: [1,] -59.77777778 -67.12777778 -73.42222222 -81.97222222 -91.7777778
: [2,]   3.00000000   3.11666667   3.21666667   3.35000000   3.5000000
: [3,]   3.77647036   2.64466036   3.70537766   4.71400546   5.1522157
: [4,]   0.06085806   0.04194352   0.05784947   0.07246601   0.0780042
:               [,6]
: [1,] -106.12777778
: [2,]    3.71666667
: [3,]    6.60219186
: [4,]    0.09846709
: > 
: 
:   The last part shows how to get the coefficients estimates and their
:   standard errors as columns of a matrix.  I think I would return the
:   coefficients and standard errors separately, as in
: 
: movingWindow <- function(formula, data, width, ...) {
:     nr = nrow(data)
:     width = as.integer(width)[1]
:     if (width <= 0 || width >= nr)
:         stop(paste("width must be in the range 1,...,", nr, sep=""))
:     nreg = nr - width
:     base = 0:(width - 1)
:     sumrys <- lapply(1:nreg,
:                      function(st) {
:                          summary(lm(formula, data[base + st,]))
:                      })
:     list(coefficients = sapply(sumrys, function(sm) coef(sm)[,"Estimate"]),
:          Std.Error = sapply(sumrys, function(sm) coef(sm)[,"Std. Error"]),
:          sigma = sapply(sumrys, "[[", "sigma"),
:          r.squared = sapply(sumrys, "[[", "r.squared"))
: }
: 
: > movingWindow(weight ~ height, women, width = 9)
: $coefficients
:                  [,1]       [,2]       [,3]      [,4]      [,5]        [,6]
: (Intercept) -59.77778 -67.127778 -73.422222 -81.97222 -91.77778 -106.127778
: height        3.00000   3.116667   3.216667   3.35000   3.50000    3.716667
: 
: $Std.Error
:                   [,1]       [,2]       [,3]       [,4]      [,5]       [,6]
: (Intercept) 3.77647036 2.64466036 3.70537766 4.71400546 5.1522157 6.60219186
: height      0.06085806 0.04194352 0.05784947 0.07246601 0.0780042 0.09846709
: 
: $sigma
: [1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228
: 
: $r.squared
: [1] 0.9971276 0.9987338 0.9977411 0.9967352 0.9965351 0.9951107
: 
:   The next enhancements come from realizing that you do not need
:   to call lm repeatedly.  The lm function involves many steps working
:   with the formula and the data frame and optional arguments to
:   produce a model matrix and response vector.   You only need to do
:   that once.  After that you can call lm.fit on subsets of the rows.
: 
:   If you arrange that the call to lm is based on the "matched call" to
:   your function then you can get the ability to work with standard
:   modeling arguments such as subset and na.action for free.  This may
:   seem like an obscure point but there are big gains in having your
:   modeling function behave like all the other R modeling functions.
:   Thomas Lumley's article on "Standard non-standard evaluation in R"
:   (which I would say was available at developer.r-project.org except
:   that the developer.r-project.org site is still down) explains this
:   in more detail.
: 
:   Furthermore, by doing the initial lm fit you find out how many
:   coefficients there are in the model and can do a better job of
:   checking for a sensible "width" argument.
: 
:   There are subtleties in this version of movingWindow2 involving
:   manipulation of the arguments in the original call to lm but these
:   are explained in the manual page for lm.  You do need to set the
:   class and the terms component in the result of lm.fit before you can
:   apply summary to it.
: 
: movingWindow2 <- function(formula, data, width, ...) {
:     mCall = match.call()
:     mCall$width = NULL
:     mCall[[1]] = as.name("lm")
:     mCall$x = mCall$y = TRUE
:     bigfit = eval(mCall, parent.frame())
:     ncoef = length(coef(bigfit))
:     nr = nrow(data)
:     width = as.integer(width)[1]
:     if (width <= ncoef || width >= nr)
:         stop(paste("width must be in the range ", ncoef + 1,
:                    ",...,", nr - 1, sep=""))
:     y = bigfit$y
:     x = bigfit$x
:     terms = bigfit$terms
:     base = 0:(width - 1)
:     sumrys <- lapply(1:(nr - width),
:                      function(st) {
:                          inds = base + st
:                          fit = lm.fit(x[inds,], y[inds])
:                          fit$terms = terms
:                          class(fit) = "lm"
:                          summary(fit)
:                      })
:     list(coefficients = sapply(sumrys, function(sm) coef(sm)[,"Estimate"]),
:          Std.Error = sapply(sumrys, function(sm) coef(sm)[,"Std. Error"]),
:          sigma = sapply(sumrys, "[[", "sigma"),
:          r.squared = sapply(sumrys, "[[", "r.squared"))
: }
: 
: > movingWindow2(weight ~ height, women, 9)
: $coefficients
:                  [,1]       [,2]       [,3]      [,4]      [,5]        [,6]
: (Intercept) -59.77778 -67.127778 -73.422222 -81.97222 -91.77778 -106.127778
: height        3.00000   3.116667   3.216667   3.35000   3.50000    3.716667
: 
: $Std.Error
:                   [,1]       [,2]       [,3]       [,4]      [,5]       [,6]
: (Intercept) 3.77647036 2.64466036 3.70537766 4.71400546 5.1522157 6.60219186
: height      0.06085806 0.04194352 0.05784947 0.07246601 0.0780042 0.09846709
: 
: $sigma
: [1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228
: 
: $r.squared
: [1] 0.9971276 0.9987338 0.9977411 0.9967352 0.9965351 0.9951107
: 
:   The use of lapply and sapply on lists is a style of programming
:   called "functional programming".  The functional programming
:   facilities in the S language are not as widely recognized and
:   appreciated as they should be.  Phil Spector's book "An Introduction
:   to S and S-PLUS" is one place where these are discussed and
:   illustrated.
: 
: P.S. If building a list of summaries is taking too much memory then
: replace summary(fit) by summary(fit)[c("coefficients", "sigma", "r.squared")]
: 
: ______________________________________________
: R-help <at> stat.math.ethz.ch mailing list
: https://www.stat.math.ethz.ch/mailman/listinfo/r-help
: PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
: 
: