[R] Loop: noob question

William Dunlap wdunlap at tibco.com
Fri Aug 5 20:20:57 CEST 2011


I don't think the "use replicate" answer is necessarily "wrong".
Making a matrix of all the random numbers you plan on using
and then using apply on it requires that you have
memory for all those random numbers.  Generating
a batch of random numbers, running quantile on them,
then discarding the random numbers (saving only
the output of quantile) will save quite a bit of memory
and lets you increase the number of repetitions more
than the first method allows.  E.g., here are three ways
of doing the same simulation with 1000 reps:

 > set.seed(1) ; t0 <- system.time(z0 <- replicate(10^3, quantile(rnorm(10^4), c(0.25, 0.75))))
 > set.seed(1) ; t1 <- system.time(z1 <- vapply(seq_len(10^3), function(i)quantile(rnorm(10^4), c(0.25, 0.75)), FUN.VALUE=numeric(2)))
 > set.seed(1) ; t2 <- system.time(z2 <- apply(matrix(rnorm(10^4 * 10^3), ncol=10^3), 2, function(col)quantile(col, c(0.25,0.75))))
 > identical(z0,z1)
 [1] TRUE
 > identical(z0,z2)
 [1] TRUE
 > rbind(t0, t1, t2)
    user.self sys.self elapsed user.child sys.child
 t0      2.53     0.01    3.92         NA        NA
 t1      2.33     0.00    3.22         NA        NA
 t2      2.94     0.10    3.85         NA        NA

Their times are not much different.  However, the matrix
approach fails on my 32-bit Windows machine when you ask
for 10000 repetitions:

 > set.seed(1) ; t0 <- system.time(z0 <- replicate(10^4, quantile(rnorm(10^4), c(0.25, 0.75))))
 > set.seed(1) ; t1 <- system.time(z1 <- vapply(seq_len(10^4), function(i)quantile(rnorm(10^4), c(0.25, 0.75)), FUN.VALUE=numeric(2)))
 > set.seed(1) ; t2 <- system.time(z2 <- apply(matrix(rnorm(10^4 * 10^4), ncol=10^4), 2, function(col)quantile(col, c(0.25,0.75))))
 Error: cannot allocate vector of size 762.9 Mb
 In addition: Warning messages:
 1: In matrix(rnorm(10^4 * 10^4), ncol = 10^4) :
   Reached total allocation of 1535Mb: see help(memory.size)
 2: In matrix(rnorm(10^4 * 10^4), ncol = 10^4) :
   Reached total allocation of 1535Mb: see help(memory.size)
 3: In matrix(rnorm(10^4 * 10^4), ncol = 10^4) :
   Reached total allocation of 1535Mb: see help(memory.size)
 4: In matrix(rnorm(10^4 * 10^4), ncol = 10^4) :
   Reached total allocation of 1535Mb: see help(memory.size)
 Timing stopped at: 18.25 0.3 21.8 
 > identical(z0,z1)
 [1] TRUE
 > rbind(t0, t1)
    user.self sys.self elapsed user.child sys.child
 t0     23.95     0.02   30.52         NA        NA
 t1     25.90     0.01   30.75         NA        NA

(I like vapply() because it throws an error if your function
returns something other than what you said it would.  It also
gives a reasonable result when the input has length zero.)

This list gets a lot of questions on the most efficient ways
of doing things.  The answer often depends on the size, shape, or
other nature of the data (or machine) you are working on.

I see a lot of requests for loopless solutions.  Here is one
for the above problem, but it takes about 5 times as long to run
as the above three loopy solutions.
  > set.seed(1) ; t3 <- system.time({ tmp <- matrix(rnorm(10^4 * 10^3), ncol=10^3);
                                      tmp[] <- tmp[order(col(tmp), tmp)] ;
                                      z3 <- tmp[round(c(1/4,3/4)*nrow(tmp)),]})

The bottom line is that if you are concerned with efficiency, you
ought learn how to measure it and get used to running tests.  Seeing
how the run time scales with various measures of your data is important.
system.time() is a good tool in base R and the rbenchmark package
makes it easy to compare various approaches.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> Sent: Friday, August 05, 2011 9:51 AM
> To: R. Michael Weylandt
> Cc: r-help at r-project.org; UnitRoot
> Subject: Re: [R] Loop: noob question
> 
> Michael:
> 
> I'm sorry, but this "advice" is wrong. replicate() **IS** essentially
> a loop: it uses sapply(), which is basically an interpreted loop (with 
> suitable caveats that R experts can provide).
> 
> The correct advice is: whenever possible, move the loops down to
> underlying C code by vectorizing. In the example below, one can partly
> do this by removing the random number generation from the loop and
> structuring the result as a matrix:
> 
> rmat <- matrix(rnorm(2.5e7, nrow=250))  ## each column is a sample of 250
> 
> Now one must use apply() -- which is a loop for -- quantile()
> 
> out <- apply(rmat,2,quantile, probs = .95)
> 
> Unfortunately, in this case, it won't make much difference, as random
> number generation is very fast anyway and the looping for quantile()
> is the same either way. In fact, both versions took almost the same
> time on my computer (replicate() was actually 3 seconds faster --- 48
> vs 51 seconds -- perhaps because sapply() is slightly more efficient
> than apply() ).
> 
> As here (I think), one often cannot vectorize and must loop in
> interpreted code, and for this replicate() is fine and yields nice
> clean code. But it's still a loop.
> 
> Cheers,
> Bert
> 
> On Fri, Aug 5, 2011 at 9:15 AM, R. Michael Weylandt
> <michael.weylandt at gmail.com> wrote:
> > This is a textbook of when NOT to use a loop in R: rather make a function
> > that does what you want and use the replicate function to do it repeatedly.
> >
> > f <- function(){
> > return(-1000*quantile(rnorm(250,0,0.2),0.95)
> > }
> >
> > x = replicate(1e5,f())
> >
> > There are your desired numbers.
> >
> > Some general coding principles: firstly, you don't need to convert to time
> > series: quantile immediately undoes that so you've just wasted the time
> > doing the coercions each direction. Secondly, the quantiles of c*X are
> > exactly c times the quantiles of X so you can cut down on the
> > multiplications needed by doing the -1000 multiplication after
> > quantilization.
> >
> > Specific to R: don't use loops unless entirely necessary. (And it's hardly
> > ever necessary) -- replicate is great for repeated operations, apply is
> > great for looping "over" thins.
> >
> > More broadly, what do you intend to do with the v95 values? There are
> > probably much more efficient ways to get the desired numbers or even closed
> > form results. I believe this idea is widely studied as VaR in finance.
> >
> > Feel free to write back if we can be of more help.
> >
> > Hope this helps,
> >
> > Michael Weylandt
> >
> > On Fri, Aug 5, 2011 at 11:35 AM, UnitRoot <akhussanov at gmail.com> wrote:
> >
> >> Hi,
> >> Can someone help me out to create a (for?) loop for the following
> >> procedure:
> >>
> >> x=rnorm(250,0,0.02)
> >> library(timeSeries)
> >> x=timeSeries(x)
> >> P=1000
> >> loss=-P*x
> >> loss
> >> v95=quantile(loss,0.95)
> >> v95
> >>
> >> I would like to generate 100 000 v95 values.
> >> Also, can anybody name a source where I could read up basic programming in
> >> R?
> >> Thank you.
> >>
> >> --
> >> View this message in context:
> >> http://r.789695.n4.nabble.com/Loop-noob-question-tp3721500p3721500.html
> >> Sent from the R help mailing list archive at Nabble.com.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> 
> --
> "Men by nature long to get on to the ultimate truths, and will often
> be impatient with elementary studies or fight shy of them. If it were
> possible to reach the ultimate truths without the elementary studies
> usually prefixed to them, these would not be preparatory studies but
> superfluous diversions."
> 
> -- Maimonides (1135-1204)
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list