[R] foreach loop, stata equivalent

Mon Feb 18 17:50:04 CET 2013

If you have the same set of predictors for multiple outcomes, you can
speed up the process considerably by taking advantage of the fact that
the design matrix is the same for multiple outcomes.  As an example:

set.seed(10)
y <- matrix(rnorm(10000 * 14), ncol = 14)
x <- matrix(rnorm(10000 * 2), ncol = 2)
system.time(res <- lapply(1:14, function(i) lm(y[, i] ~ x)))
##   user  system elapsed
##   0.34    0.00    0.34
system.time(res2 <- lm(y ~ x))
##   user  system elapsed
##   0.05    0.02    0.06

lm can accept a matrix as the dependent variable.  So if various
combinations of variables predict all 14 outcomes, do not fit 14 *
number of combinations of predictors separately, do them in chunks for
substantial speedups.

Finally, as long as you are not using factors or any other fancy
things, and can work with just raw data matrices, instead of using
lm(), which is a high level function, use lm.fit().  It is not
especially clever, just expects the design matrix and response matrix.
 It will not an intercept by default, so to your data column bind on a
vector of 1s.

system.time(res3 <- lm.fit(cbind(1, x), y))
##   user  system elapsed
##   0.02    0.00    0.01

These three methods produce identical results:

res[[1]]
## Call:
## lm(formula = y[, i] ~ x)
## Coefficients:
## (Intercept)           x1           x2
##  0.0014401   -0.0198232   -0.0005721
res2[[1]][, 1]
##   (Intercept)            x1            x2
##  0.0014401149 -0.0198232209 -0.0005720764
res3[[1]][, 1]
##            x1            x2            x3
##  0.0014401149 -0.0198232209 -0.0005720764

however, fit each response one at a time instead of taking advantage
of fitting multiple responses at once (so the design matrix can be
reused) and taking advantage of lower level functions when you have a
simple, specific, repetitive task takes the estimation time from .34
down to .01.  You would need 34 cores to achieve that by simply
throwing more hardware at the problem as opposed to using more
optimized code.  Of course you can do both, and probably get the
results pretty quickly.

Cheers,

Josh

On Mon, Feb 18, 2013 at 8:25 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> Le lundi 18 février 2013 à 17:09 +0100, Jamora, Nelissa a écrit :
>> Hi Milan,
>>
>> Thanks for responding to my question. I'm actually not interested in LM, it
>> was more  for example.
>>
>> You are right, I'm trying an enormous set of model runs. My Var1 n=14; Var2
>> n=255 ==> 3570!
>> But first, I need be able to set up 2 variables in each model run. Those 2
>> variables are different in each case. I can set this up 1-by-1 but it will be
>> tedious and not efficient.
>>
>> To describe in more details
>> I have a data frame with 269 variables.
>> 1. individual columns 1-14 can be my first variable
>> 2. individual columns 15-269 can be my second variable.
>>
>> Variable1 and variable2 are different in each case. For e.g.
>> Model 1: var1 and var15
>> Model 2: var1 and var16
>> Model 3: var1 and var17...
>> ....
>> Model 3570: var14 and var269
>> So I need to write a loop command that calls for different sets of variable1
>> and variable2 in each run.
>>
>> What do I intend to do with this? I'm running threshold vecm (package tsDyn),
>> and I need to summarize threshold estimates in each model run (or market
>> pairs, var1 and var2). The goal is to extract N=3,570 threshold estimates.
>> I did similar linear VECM estimates in Stata using my foreach loop, but now I
>> need to make parallel run in R but using threshold model.
> Ah, if you need parallelism, you can likely try something like this
> (untested) :
>
> # Create cluster cl before and export yourData to them
> parLapply(cl, paste0("p", 1:14)), function(var) {
>     lapply(paste0("p", 15:269), function(y) {
>          lm(yourData[[var]] ~ yourData[[y]])
>     })
> }
>
> This will only be optimal if you have less than 14 cores.
>
>
> Regards
>
>
>> Hope this clears things.
>> Nelissa
>>
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Milan Bouchet-Valat [mailto:nalimilan at club.fr]
>> Sent: Monday, February 18, 2013 3:44 PM
>> To: Jamora, Nelissa
>> Cc: r-help at r-project.org
>> Subject: Re: [R] foreach loop, stata equivalent
>>
>> Le lundi 18 février 2013 à 13:48 +0100, Jamora, Nelissa a écrit :
>> > Hi! I'm a recent convert from Stata, so forgive my ignorance.
>> >
>> >
>> >
>> > In Stata, I can write foreach loops (example below)
>> >
>> >
>> >
>> > foreach var of varlist p1-p14 {
>> >
>> > foreach y of varlist p15-p269 {
>> >
>> >   reg `var' `y'
>> >
>> > }
>> >
>> > }
>> >
>> >
>> >
>> > It's looping p1-p15, p1-p16...., p1-p269, p2-p15, p2-p16,... p2-p269,...
>> > variable pairs.
>> >
>> >
>> >
>> > How can I write something similar in R?
>> >
>> > I 'tried' understanding the package.foreach but can't get it to work.
>> You do not need package foreach, which is intended at a completely different
>> problem.
>>
>> R does not really have the syntactic equivalent of "varlist", but you can
>> easily do something like:
>> for(var in paste0("p", 1:14)) {
>>     for(y in paste0("p", 15:269))
>>         lm(yourData[[var]] ~ yourData[[y]]) }
>>
>> provided that yourData is the data frame in which the p* variables are
>> stored.
>>
>> There are probably more direct ways of doing the same thing and storing the
>> resulting lm objects in a list, but you did not state what you intend to do
>> with this enormous set of regressions...
>>
>>
>> Regards
>>
>> > Thanks for any help
>> >
>> > Nelissa
>> >
>> >
>> >     [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/