[R] Summary: A speed improvement challenge

Sun Oct 28 13:58:54 CET 2001

Many thanks to Nick Ellis, Vadim Kutsyy, Charles Berry, and
Bill Dunlap for providing thoughts and alternative solutions to the
problem.
I have included Nick and Charles' notes in full below and a summary of
the others.  Thanks also to Bill for telling me about an
inconsistency in how the first argument of sample is interpreted.
I had ignored this, resulting in a major bug.

Each person suggested substituting a built-in function
such as sapply, apply, or outer for the explicit for() looping.
I think that outer causes memory problems for large
n and m, and that sapply and apply 
do not reduce execution time (but this is platform and
language version-specific).  On S-Plus 6.0 under Intel
Linux, Vadim's sapply() suggestion results in more than
doubling of the execution time over using a simple for().

Bill Dunlap provided functions for zeroing in on the subscripts
that need to be addressed.

My conclusion: For this problem, for() isn't bad.  The
code is simple and maintainable and fairly efficient
depending on the platform.  To get a major speed
improvement I would write a simple and small Fortran or C function
to be called from S.

One area in which I hope to improve the original code (which
is at the end of this note) is to also make the sampling
weights for selecting x depending on the distance of y
from the target y, a la kernel density estimators or
loess.

Thanks again for the thoughtful responses,

Frank Harrell

---------------------------------------------------------------------
Frank,

to restate your problem:

1.) do the linear interpolation as a first cut.
2.) find the subset of points that have more than one neighbour - this
is
kernel density estimation with a boxcar window of half-width del
3.) perform multinomial sampling on the x values for that subset - ie
from
(abc)(defg)(hij) select, say, afj, with probabilities (pqr)(stuv)(wxy)

Step 2 could be done with a call to outer(), but that would probably be
wasteful, especially 'cos it doesn't exploit the ordering of the values.
A C
program that runs through the sorted values would be quick.

As for 3, sample() does multinomial sampling for a single set of p
values
(as you stated), but it's not hard to implement sample() using runif,
and it
should be straightforward to vectorize. Something like

p <- list(c(1,2,3)/6,c(3,1)/4,c(1,1,1,1)/4)
n <- max(sapply(p,length))
pcum <- sapply(p,function(x,n)c(cumsum(x)/sum(x),rep(1,n))[1:n],n=n) #
make
pcum's all same length
apply(sweep(pcum,2,runif(2),">"),2,function(x,n)(1:n)[x][1],n=n)

Nick Ellis
CSIRO Marine Research	mailto:Nick.Ellis at csiro.au

-------------------------------------------------------------------------------
try:

yinv<-sapply(1:m,function(i,x,y,freq,aty) ifelse(any(s<-abs(y-aty[i]) <
del),
sample(x[s], 1, replace=F, prob=freq[s]/sum(freq[s])),
approx(y, x, xout=aty[i], rule=2)$y),
x,y,freq,aty)

===================================================================
Vadim Kutsyy, PhD                             http://www.kutsyy.com
vkutsyy at cytokinetics.com                           vadim at kutsyy.com
Statistician                                      

----------------------------------------------------------------------------
Off the top of my head:

First, move the test 

        any( (s <- abs(y-aty[i])) < del )

outside the loop. I think 

        OK.2.sample <- ( abs( min(y)-aty ) < del ) | ( abs( max(y)-aty )
< del )
[I think OK.2.sample needs to take into account more than min and max
-FH]

does this correctly in a vectorized manner.

Then you can attack the sample() and approx() separately.

Second, 

        s.mat <- abs( outer(y,aty[OK.2.sample],"-") ) < del

vectorizes calculation of s.

Then,

        new.freq <- array(0,nr=nrow(s.mat), nc=ncol(s.mat) )
        new.freq[s.mat] <- freq[col(s.mat)[s.mat]]
        new.freq <- new.freq/ c( new.freq %*% rep( 1, ncol(s.mat) ))

gives a matrix whose rows are the probability eights padded with zeroes
and 
        cum.freq <- apply(new.freq,1,cumsum)

is the cumulative weight.

        leq.weight <- rep(runif(length(y)), ncol(s.mat) ) <= cum.freq
        which.2.sample <- leq.weight %*% rep(1, ncol(s.mat) )

I ***think*** which.2.sample indexes x as you require.

If this is still too slow, I think the part from s.mat onward can be
rendered in C with little trouble. In fact, it might be easier to
understand in C...

-Charles Berry

-----------------------------------------------------------------------------

> -----Original Message-----
> From: Frank E Harrell Jr [mailto:fharrell at virginia.edu]
> Sent: Friday, October 26, 2001 3:01 AM
> To: s-news; rhelp
> Subject: [S] A speed improvement challenge
> 
> 
> I need to create a vector of probability-weighted randomly
> sampled values which each satisfy a different criterion.
> This causes the sampling weights to vary between elements
> of the vector.  Here is some example code, with x, y,
> and freq being vectors each of length n.  aty is a
> vector of length m.
> 
>   yinv <- double(m)
>   for(i in 1:m) {
>     s <- abs(y-aty[i]) < del
>     yinv[i] <- if(any(s)) 
>      sample(x[s], 1, replace=F, prob=freq[s]/sum(freq[s])) else
>      approx(y, x, xout=aty[i], rule=2)$y
>   }
> 
> Big picture: For a tabulated function given by (x,y) and
> frequency of each occurrence of (x,y) given by the corresponding
> element in freq, find the inverse of the function (x as a function
> of y) for a vector of desired y-values aty[1],...aty[m].  If
> the function has a nearly flat segment, let the resulting x[i] value
> be a weighted randomly sampled x such that f(x) is within del of
> the target y-value aty.  If no tabulated y is within del of the
> target y, use reverse linear interpolation to get y inverse.
> The reverse linear interpolation can easily be vectorized
> (approx(y, x, xout=aty, rule=2)$y).
> 
> Thanks for any ideas.
> -- 
> Frank E Harrell Jr              Prof. of Biostatistics & Statistics
> Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
> U. Virginia School of Medicine  
> http://hesweb1.med.virginia.edu/biostat
> ---------------------------------------------------------------------
> This message was distributed by s-news at lists.biostat.wustl.edu.  To
> unsubscribe send e-mail to s-news-request at lists.biostat.wustl.edu with
> the BODY of the message:  unsubscribe s-news
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._