[R] Summary: A speed improvement challenge

fharrell@virginia.edu fharrell at virginia.edu
Sun Oct 28 13:58:54 CET 2001

Many thanks to Nick Ellis, Vadim Kutsyy, Charles Berry, and
Bill Dunlap for providing thoughts and alternative solutions to the
I have included Nick and Charles' notes in full below and a summary of
the others.  Thanks also to Bill for telling me about an
inconsistency in how the first argument of sample is interpreted.
I had ignored this, resulting in a major bug.

Each person suggested substituting a built-in function
such as sapply, apply, or outer for the explicit for() looping.
I think that outer causes memory problems for large
n and m, and that sapply and apply 
do not reduce execution time (but this is platform and
language version-specific).  On S-Plus 6.0 under Intel
Linux, Vadim's sapply() suggestion results in more than
doubling of the execution time over using a simple for().

Bill Dunlap provided functions for zeroing in on the subscripts
that need to be addressed.

My conclusion: For this problem, for() isn't bad.  The
code is simple and maintainable and fairly efficient
depending on the platform.  To get a major speed
improvement I would write a simple and small Fortran or C function
to be called from S.

One area in which I hope to improve the original code (which
is at the end of this note) is to also make the sampling
weights for selecting x depending on the distance of y
from the target y, a la kernel density estimators or

Thanks again for the thoughtful responses,

Frank Harrell


to restate your problem:

1.) do the linear interpolation as a first cut.
2.) find the subset of points that have more than one neighbour - this
kernel density estimation with a boxcar window of half-width del
3.) perform multinomial sampling on the x values for that subset - ie
(abc)(defg)(hij) select, say, afj, with probabilities (pqr)(stuv)(wxy)

Step 2 could be done with a call to outer(), but that would probably be
wasteful, especially 'cos it doesn't exploit the ordering of the values.
program that runs through the sorted values would be quick.

As for 3, sample() does multinomial sampling for a single set of p
(as you stated), but it's not hard to implement sample() using runif,
and it
should be straightforward to vectorize. Something like

p <- list(c(1,2,3)/6,c(3,1)/4,c(1,1,1,1)/4)
n <- max(sapply(p,length))
pcum <- sapply(p,function(x,n)c(cumsum(x)/sum(x),rep(1,n))[1:n],n=n) #
pcum's all same length

Nick Ellis
CSIRO Marine Research	mailto:Nick.Ellis at csiro.au

yinv<-sapply(1:m,function(i,x,y,freq,aty) ifelse(any(s<-abs(y-aty[i]) <
sample(x[s], 1, replace=F, prob=freq[s]/sum(freq[s])),
approx(y, x, xout=aty[i], rule=2)$y),

Vadim Kutsyy, PhD                             http://www.kutsyy.com
vkutsyy at cytokinetics.com                           vadim at kutsyy.com

Off the top of my head:

First, move the test 

        any( (s <- abs(y-aty[i])) < del )

outside the loop. I think 

        OK.2.sample <- ( abs( min(y)-aty ) < del ) | ( abs( max(y)-aty )
< del )
[I think OK.2.sample needs to take into account more than min and max

does this correctly in a vectorized manner.

Then you can attack the sample() and approx() separately.


        s.mat <- abs( outer(y,aty[OK.2.sample],"-") ) < del

vectorizes calculation of s.


        new.freq <- array(0,nr=nrow(s.mat), nc=ncol(s.mat) )
        new.freq[s.mat] <- freq[col(s.mat)[s.mat]]
        new.freq <- new.freq/ c( new.freq %*% rep( 1, ncol(s.mat) ))

gives a matrix whose rows are the probability eights padded with zeroes
        cum.freq <- apply(new.freq,1,cumsum)

is the cumulative weight.

        leq.weight <- rep(runif(length(y)), ncol(s.mat) ) <= cum.freq
        which.2.sample <- leq.weight %*% rep(1, ncol(s.mat) )

I ***think*** which.2.sample indexes x as you require.

If this is still too slow, I think the part from s.mat onward can be
rendered in C with little trouble. In fact, it might be easier to
understand in C...

-Charles Berry


> -----Original Message-----
> From: Frank E Harrell Jr [mailto:fharrell at virginia.edu]
> Sent: Friday, October 26, 2001 3:01 AM
> To: s-news; rhelp
> Subject: [S] A speed improvement challenge
> I need to create a vector of probability-weighted randomly
> sampled values which each satisfy a different criterion.
> This causes the sampling weights to vary between elements
> of the vector.  Here is some example code, with x, y,
> and freq being vectors each of length n.  aty is a
> vector of length m.
>   yinv <- double(m)
>   for(i in 1:m) {
>     s <- abs(y-aty[i]) < del
>     yinv[i] <- if(any(s)) 
>      sample(x[s], 1, replace=F, prob=freq[s]/sum(freq[s])) else
>      approx(y, x, xout=aty[i], rule=2)$y
>   }
> Big picture: For a tabulated function given by (x,y) and
> frequency of each occurrence of (x,y) given by the corresponding
> element in freq, find the inverse of the function (x as a function
> of y) for a vector of desired y-values aty[1],...aty[m].  If
> the function has a nearly flat segment, let the resulting x[i] value
> be a weighted randomly sampled x such that f(x) is within del of
> the target y-value aty.  If no tabulated y is within del of the
> target y, use reverse linear interpolation to get y inverse.
> The reverse linear interpolation can easily be vectorized
> (approx(y, x, xout=aty, rule=2)$y).
> Thanks for any ideas.
> -- 
> Frank E Harrell Jr              Prof. of Biostatistics & Statistics
> Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
> U. Virginia School of Medicine  
> http://hesweb1.med.virginia.edu/biostat
> ---------------------------------------------------------------------
> This message was distributed by s-news at lists.biostat.wustl.edu.  To
> unsubscribe send e-mail to s-news-request at lists.biostat.wustl.edu with
> the BODY of the message:  unsubscribe s-news
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list