[R] C/C++/Fortran Rolling Window Regressions

Thu Jul 21 22:56:17 CEST 2016

I appreciate the timing, so much so I changed the code to show the issue.
 It is a problem of scale.

 roll_lm probably has a heavy start-up cost but otherwise completely
out-performs those other versions at scale.  I suspect you are timing the
nearly  constant time start-up cost in small data.  I did give code to
paint a picture, but it was just cartoon code lifted from stackexchange.
If you want to characterize the real problem it is closer to:
30 day rolling windows on 24 daily (by hour) measurements for 5 years with
24+7 -1 dummy predictor variables and finally you need to do this for 300
sets of data.

Pseudo-code is closer to what follows and roll_lm can handle that input in
a timely manner.  You can do it with lm.fit, but you need to spend a lot of
time waiting.  The issue of accuracy needs a follow-up check.  Not sure why
it would be different.  Worth a check on that.

Thanks,
Jeremiah

library(rbenchmark)
N = 30*24*12*5
window = 30*24
npred = 15  #15 chosen arbitrarily...
set.seed(1)
library(zoo)
library(RcppArmadillo)
library(roll)
x = matrix(rnorm(N*(npred+1)), ncol = npred+1)
colnames(x) <- c("y",  paste0("x", 1:npred))
z <- zoo(x)

benchmark(
   roll_lm =  roll_lm(coredata(z[, 1, drop = F]), coredata(z[, -1, drop =
F]), window,
     center = FALSE), replications=3)

Which comes out as:
     test replications elapsed relative user.self sys.self user.child
sys.child
1 roll_lm            3   6.273        1    38.312    0.654          0
  0

## You arn't going to get that below...

benchmark(fastLm = rollapplyr(z, width = window,
     function(x) coef(fastLm(cbind(1, x[, -1]), x[, 1])),
     by.column = FALSE),
   lm = rollapplyr(z, width = window,
     function(x) coef(lm(y ~ ., data = as.data.frame(x))),
     by.column = FALSE), replications=3)

On Thu, Jul 21, 2016 at 1:28 PM, Gabor Grothendieck <ggrothendieck at gmail.com
> wrote:

> I would be careful about making assumptions regarding what is faster.
> Performance tends to be nonintuitive.
>
> When I ran rollapply/lm, rollapply/fastLm and roll_lm on the example
> you provided rollapply/fastLm was three times faster than roll_lm.  Of
> course this could change with data of different dimensions but it
> would be worthwhile to do actual benchmarks before making assumptions.
>
> I also noticed that roll_lm did not give the same coefficients as the
> other two.
>
> set.seed(1)
> library(zoo)
> library(RcppArmadillo)
> library(roll)
> z <- zoo(matrix(rnorm(10), ncol = 2))
> colnames(z) <- c("y", "x")
>
> ## rolling regression of width 4
> library(rbenchmark)
> benchmark(fastLm = rollapplyr(z, width = 4,
>      function(x) coef(fastLm(cbind(1, x[, 2]), x[, 1])),
>      by.column = FALSE),
>    lm = rollapplyr(z, width = 4,
>      function(x) coef(lm(y ~ x, data = as.data.frame(x))),
>      by.column = FALSE),
>    roll_lm =  roll_lm(coredata(z[, 1, drop = F]), coredata(z[, 2, drop =
> F]), 4,
>      center = FALSE))[1:4]
>
>
>      test replications elapsed relative
> 1  fastLm          100    0.22    1.000
> 2      lm          100    0.72    3.273
> 3 roll_lm          100    0.64    2.909
>
> On Thu, Jul 21, 2016 at 3:45 PM, jeremiah rounds
> <roundsjeremiah at gmail.com> wrote:
> >  Thanks all.  roll::roll_lm was essentially what I wanted.   I think
> maybe
> > I would prefer it to have options to return a few more things, but it is
> > the coefficients, and the remaining statistics you might want can be
> > calculated fast enough from there.
> >
> >
> > On Thu, Jul 21, 2016 at 12:36 PM, Achim Zeileis <
> Achim.Zeileis at uibk.ac.at>
> > wrote:
> >
> >> Jeremiah,
> >>
> >> for this purpose there are the "roll" and "RcppRoll" packages. Both use
> >> Rcpp and the former also provides rolling lm models. The latter has a
> >> generic interface that let's you define your own function.
> >>
> >> One thing to pay attention to, though, is the numerical reliability.
> >> Especially on large time series with relatively short windows there is a
> >> good chance of encountering numerically challenging situations. The QR
> >> decomposition used by lm is fairly robust while other more
> straightforward
> >> matrix multiplications may not be. This should be kept in mind when
> writing
> >> your own Rcpp code for plugging it into RcppRoll.
> >>
> >> But I haven't check what the roll package does and how reliable that
> is...
> >>
> >> hth,
> >> Z
> >>
> >>
> >> On Thu, 21 Jul 2016, jeremiah rounds wrote:
> >>
> >> Hi,
> >>>
> >>> A not unusual task is performing a multiple regression in a rolling
> window
> >>> on a time-series.    A standard piece of advice for doing in R is
> >>> something
> >>> like the code that follows at the end of the email.  I am currently
> using
> >>> an "embed" variant of that code and that piece of advice is out there
> too.
> >>>
> >>> But, it occurs to me that for such an easily specified matrix operation
> >>> standard R code is really slow.   rollapply constantly returns to R
> >>> interpreter at each window step for a new lm.   All lm is at its heart
> is
> >>> (X^t X)^(-1) * Xy,  and if you think about doing that with Rcpp in
> rolling
> >>> window you are just incrementing a counter and peeling off rows (or
> >>> columns
> >>> of X and y) of a particular window size, and following that up with
> some
> >>> matrix multiplication in a loop.   The psuedo-code for that Rcpp
> >>> practically writes itself and you might want a wrapper of something
> like:
> >>> rolling_lm (y=y, x=x, width=4).
> >>>
> >>> My question is this: has any of the thousands of R packages out there
> >>> published anything like that.  Rolling window multiple regressions that
> >>> stay in C/C++ until the rolling window completes?  No sense and
> writing it
> >>> if it exist.
> >>>
> >>>
> >>> Thanks,
> >>> Jeremiah
> >>>
> >>> Standard (slow) advice for "rolling window regression" follows:
> >>>
> >>>
> >>> set.seed(1)
> >>> z <- zoo(matrix(rnorm(10), ncol = 2))
> >>> colnames(z) <- c("y", "x")
> >>>
> >>> ## rolling regression of width 4
> >>> rollapply(z, width = 4,
> >>>   function(x) coef(lm(y ~ x, data = as.data.frame(x))),
> >>>   by.column = FALSE, align = "right")
> >>>
> >>> ## result is identical to
> >>> coef(lm(y ~ x, data = z[1:4,]))
> >>> coef(lm(y ~ x, data = z[2:5,]))
> >>>
> >>>         [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

	[[alternative HTML version deleted]]