[R] Operating on windows of data

Gabor Grothendieck ggrothendieck at myway.com
Wed Mar 24 06:50:05 CET 2004


[I am having some problems with returned email.  Sorry if
you get this twice.]

Well so much for pontificating. My "shining example" has an 
error too although this error is less serious since it will
give an error message. embed(T,width) should be embed(1:T,width).

But perhaps there is more to this than we realize since the
subsequent (space inefficient) solution which I thought might 
be taking it too far does not require calculation of T except
in embed.data.frame which is a reusable function that would
likely be of higher quality through reuse and in any case the
fact is that it did avoid this error.


Date:   Wed, 24 Mar 2004 00:09:19 -0500 (EST) 
From:   Gabor Grothendieck <ggrothendieck at myway.com>
To:   <ok at cs.otago.ac.nz>, <r-help at stat.math.ethz.ch> 
Cc:   <ajayshah at mayin.org> 
Subject:   Re: [R] Operating on windows of data 

 


Far from being an example where vectorization
seems to have only minor advantages, if any, over a 
non-vectorized approach, it turns out that this is a shining 
example of why vectorization produces higher quality code.

Consider that the loop approach and all the other
solutions so far have an error! The last time point is
never included in any window.

The genesis of the error is that index arithmetic for
producing the loop limits and sapply index is sufficiently
complex that no one noticed the error. Vectorizing
*sufficiently* eliminates the calculation of these limits
entirely and so avoids the possibility of mistake. The
problem so far is that we just did not take the vectorized
approach far enough (mea culpa).

The following solution uses embed (which can be regarded as
a vectorized sliding window) to get us out of considering such
explicit limits. By eliminating the calculation of these limits
it avoids any potential for this sort of error in the first place.
(It also avoids the transpose by producing data frames at
each step and rbinding them.)


do.call( "rbind", apply( embed(T, width), 1, 
function(idx) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
           data = A, subset = idx)
s <- summary.lm(model)
v <- coefficients(model)
data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], 
DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
} ) 
)

Note that the above example loops over vectors of indices
instead of single indexes. There is an opportunity to do away 
with indices altogether although it is at the considerable
expense of space efficiency so, in practice, one would likely 
be content with the last solution. Nevertheless, it is of 
interest to display the next solution even if only for sake of completing the thought.

Define embed.data.frame to produce a list of data frames that
represents the sliding windows. Perform a lapply over those:

embed.data.frame <- function(df,width) 
apply(embed(1:nrow(df),width),1,function(idx)df[idx,])

do.call( "rbind", lapply(embed.data.frame(df,width), 
function(df) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = df)
s <- summary.lm(model)
v <- coefficients(model)
data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], 
DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
} )
)

This gets rid of all reference to idx and nrow(df).


---

Date: Wed, 24 Mar 2004 14:56:46 +1200 (NZST) 
From: Richard A. O'Keefe <ok at cs.otago.ac.nz>
To: <r-help at stat.math.ethz.ch> 
Subject: Re: [R] Operating on windows of data 

[snip]

Hmm. My "first" and "second" ways are both the same: use names rather
than position. There is one more clarity improvement to recommend, and
it has nothing to do with using or avoiding sapply(), at least not
directly.

# dfapply(X, FUN, ...) is like sapply() but
# expects FUN to return c(x1=...,xn=...) vectors which it
# turns into rows of the data frame that it returns.

dfapply <- function (...) as.data.frame(t(sapply(...)))

# Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE
# and rows 1:T-width, the ith row extracted from a linear
# regression on cases i:(i+width-1).

dat <- dfapply(seq(T-width), function (i) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
data = A, subset = i:(i+width-1))
s <- summary.lm(model)
v <- coefficients(model)
c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"],
R2 = s$r.squared, RMSE = s$sigma)
})

Now here's where using sapply() instead of 'for' does pay off, even here.
We ask the question "where is 'i' used?" Because we're *not* using i in
any visible index calculations, there is only one place that 'i' is used,
and that's in the subset= argument of the lm() call.

That prompts the question "is there any way to exploit the fact that the
rest of the linear model is the same? Depending on the relative sizes
of A and T-width, there may well be, and Statistical Models in S explains,
if memory serves me, how to do this kind of thing. But without the fact
that i is only used in one place, it might not be as obvious that it was
worth thinking about.




More information about the R-help mailing list