[R] R Optimization 101 for R Newcomers: An extended example

Wed May 19 03:20:06 CEST 2004

To new R-Programmers:

This long-winded note gives an example of optimizing in R that should
only be of interest to newcomers to the language. Others should ignore.
My hope is that it might help illuminate some basic notions of code
improvement, looping, and vectorization in R. However, I welcome
comments from anyone that enlarge, correct, or improve it. I remain a
novice R programmer.

For the curious, the hardware/opsys details are a 1.5 ghz Windows 2000
machine, R1.9.0.

The task is this: I have a data set, mydat, consisting of values >=0,
and a fixed constant, K, that is "much" larger than any of the values. I
want to simulate the distribution of the number of bootstrap draws
(random sampling with replacement) from mydat that is needed  to make
the sum of the values drawn just exceed K. That is, repeatedly bootstrap
sample from mydat until the sum of the results exceeds K. Suppose this
requires 7 draws. Then 7 is a random sample of size 1 from the
distribution I'm interested in. I want to repeat this procedure a
"large" number, nsim, of times to estimate the distribution.

Before reading on, I invite R novices to consider how -- or better yet
write code -- to do this.

Approach 1 (Naive/ straightforward): Perform the simulation exactly as
described by using a nested loop. The outer loop goes from 1 to nsim to
record the number of draws needed. The inner loop repeatedly draws a
sample of size 1 from mydat and accumulates and checks the sum until it
exceeds K, returning the number of times it took to do this to the outer
loop.

This is easy to program up using a "while" loop for the inner loop and
takes about a short look's time out my window (a few seconds) to run for
modest nsim (5-10K). But surely one can do better...

Approach 2: Was essentially the same, except all the sampling is done
"at once." That is, each execution of the inner loop in (1) required R
to call the "sample()" function to get a bootstrap sample of size 1.
This is very inefficient, because there is a high overhead of making
that function call at each iteration. It can be avoided by calling
sample() only once -- or at most a small number of times -- to get a
large sample and then using indexing to work one's way through the large
sample. This turns out to be a little more complicated than approach 1
because you first have to figure how large a "large" sample should be
and then have to be a little careful about setting up your indexing. But
it's not very difficult (especially for any C programmer who can
manipulate pointers), and random sampling is so fast that it's no
problem to generate a sample WAYYY bigger than you need (5 or 10 times
as large, even) just to be safe. Or generate a not so big version and
just check at each outer loop iteration whether you need to generate
some more.

Doing it this way -- now using indexing, not function calls in the inner
loop -- made a considerable improvement (5 - 10 fold). It now took about
one quick glance out the window time to generate the distribution (less
than a second). This was more than adequate for my needs under any
conceivable situation. But still, self-respecting R programmers are
supposed to eschew looping, and here was this ugly(?) loop within a
loop!  So let us persevere.

Approach 3: So the question is -- how to remove that inner loop? Again,
I invite novice R programmers to think about that before continuing
on...

The key here is that the inner loop is doing a very simple calculation:
just accumulating a sum.  Forsooth! -- the answer fairly leaps out: R
already has a cumsum() function that does this (looping in the
underlying C code), so one only has to figure out how to make use of it.

The essential idea to do this is to get a small sample from the large
mydat sample that is guaranteed to be bigger than one needs to total up
to more than K. Again, one can be profligate: if I need about 10 mydat
values on average to sum up to K, if I sample 30 or 40 or some such
thing, I'm sure to have enough (and can always add a check or use
"try()" to make sure if I am faint of heart). If smallsamp is this
sample of the next 30 or 40 values from my large bootstrap sample, then
the following code vectorizes the inner loops:

count <- sum(cumsum(smallsamp)<K)+1

Note a trick here: The <K produces a vector of logical TRUE's for all
cumsum values <K. This is silently treated as numeric 1's when added by
sum(). A moment's thought will make clear why 1 must be added at the
end..

Sure enough, this vectorization  reduced the time about another twofold
-- to an eyeblink --  for my modest nsim sizes.

Approach 4: The last stage, of course, is to get rid of the outer loop,
which fills the vector for the distribution one element at a time, again
another R no-no. One way to do this -- the ONLY way I could think of (so
a challenge to smarter R programmers) -- is to make use of apply(),
which basically always can remove loops. Unfortunately, this is an
illusion, because what the apply() functions actually do is hide the
loops in clever R code, not remove them. Their virtue, as will be seen
here, is making the code more transparent by transferring the
bookkeeping of indexing to the R functions rather than having to
explicitly handle it yourself.

To use apply, the simple idea is just to make up a matrix with nsim
columns with enough rows so that each column contains "enough" random
draws from mydat to sum to more than K. Let's say that we decide M rows
will do it. Then the code for the whole simulation is:

   drawmat<-matrix(sample(mydat,M*nsim,rep=TRUE),ncol=nsim)
    drawmat <-apply(drawmat,2,cumsum)<K
    result<- colSums(drawmat)+1

The first statement generates the matrix of bootstrap samples, while the
second and third use the same trick as previously to get the number of
rows needed for each column to total more than K. Note that I used the
handy (and fast) internal colSums function to add up the 1's in the
columns. This code is much shorter, and I would say more transparent,
then the code produced by the previous explicit looping strategies.

However, not surprisingly, it does not run any faster (nor slower) than
the Approach 3 code.The looping is still there hidden in the apply().

I hope that this long-winded example will help R newcomers in their
entry into what may seem like a daunting programming language. As others
have, I would urge all to consult R's quite remarkable documentation and
Help facility (BEFORE they post to this list), as well as Venables's and
Ripley's two indispensable books on S language programming and data
analysis.

Cheers,
Bert Gunter

Non-Clinical Biostatistics
Genentech

"The business of the statistician is to catalyze the scientific learning
process."

 -- George E.P. Box