[R] Summary: A speed improvement challenge
fharrell@virginia.edu
fharrell at virginia.edu
Sun Oct 28 13:58:54 CET 2001
Many thanks to Nick Ellis, Vadim Kutsyy, Charles Berry, and
Bill Dunlap for providing thoughts and alternative solutions to the
problem.
I have included Nick and Charles' notes in full below and a summary of
the others. Thanks also to Bill for telling me about an
inconsistency in how the first argument of sample is interpreted.
I had ignored this, resulting in a major bug.
Each person suggested substituting a built-in function
such as sapply, apply, or outer for the explicit for() looping.
I think that outer causes memory problems for large
n and m, and that sapply and apply
do not reduce execution time (but this is platform and
language version-specific). On S-Plus 6.0 under Intel
Linux, Vadim's sapply() suggestion results in more than
doubling of the execution time over using a simple for().
Bill Dunlap provided functions for zeroing in on the subscripts
that need to be addressed.
My conclusion: For this problem, for() isn't bad. The
code is simple and maintainable and fairly efficient
depending on the platform. To get a major speed
improvement I would write a simple and small Fortran or C function
to be called from S.
One area in which I hope to improve the original code (which
is at the end of this note) is to also make the sampling
weights for selecting x depending on the distance of y
from the target y, a la kernel density estimators or
loess.
Thanks again for the thoughtful responses,
Frank Harrell
---------------------------------------------------------------------
Frank,
to restate your problem:
1.) do the linear interpolation as a first cut.
2.) find the subset of points that have more than one neighbour - this
is
kernel density estimation with a boxcar window of half-width del
3.) perform multinomial sampling on the x values for that subset - ie
from
(abc)(defg)(hij) select, say, afj, with probabilities (pqr)(stuv)(wxy)
Step 2 could be done with a call to outer(), but that would probably be
wasteful, especially 'cos it doesn't exploit the ordering of the values.
A C
program that runs through the sorted values would be quick.
As for 3, sample() does multinomial sampling for a single set of p
values
(as you stated), but it's not hard to implement sample() using runif,
and it
should be straightforward to vectorize. Something like
p <- list(c(1,2,3)/6,c(3,1)/4,c(1,1,1,1)/4)
n <- max(sapply(p,length))
pcum <- sapply(p,function(x,n)c(cumsum(x)/sum(x),rep(1,n))[1:n],n=n) #
make
pcum's all same length
apply(sweep(pcum,2,runif(2),">"),2,function(x,n)(1:n)[x][1],n=n)
Nick Ellis
CSIRO Marine Research mailto:Nick.Ellis at csiro.au
-------------------------------------------------------------------------------
try:
yinv<-sapply(1:m,function(i,x,y,freq,aty) ifelse(any(s<-abs(y-aty[i]) <
del),
sample(x[s], 1, replace=F, prob=freq[s]/sum(freq[s])),
approx(y, x, xout=aty[i], rule=2)$y),
x,y,freq,aty)
===================================================================
Vadim Kutsyy, PhD http://www.kutsyy.com
vkutsyy at cytokinetics.com vadim at kutsyy.com
Statistician
----------------------------------------------------------------------------
Off the top of my head:
First, move the test
any( (s <- abs(y-aty[i])) < del )
outside the loop. I think
OK.2.sample <- ( abs( min(y)-aty ) < del ) | ( abs( max(y)-aty )
< del )
[I think OK.2.sample needs to take into account more than min and max
-FH]
does this correctly in a vectorized manner.
Then you can attack the sample() and approx() separately.
Second,
s.mat <- abs( outer(y,aty[OK.2.sample],"-") ) < del
vectorizes calculation of s.
Then,
new.freq <- array(0,nr=nrow(s.mat), nc=ncol(s.mat) )
new.freq[s.mat] <- freq[col(s.mat)[s.mat]]
new.freq <- new.freq/ c( new.freq %*% rep( 1, ncol(s.mat) ))
gives a matrix whose rows are the probability eights padded with zeroes
and
cum.freq <- apply(new.freq,1,cumsum)
is the cumulative weight.
leq.weight <- rep(runif(length(y)), ncol(s.mat) ) <= cum.freq
which.2.sample <- leq.weight %*% rep(1, ncol(s.mat) )
I ***think*** which.2.sample indexes x as you require.
If this is still too slow, I think the part from s.mat onward can be
rendered in C with little trouble. In fact, it might be easier to
understand in C...
-Charles Berry
-----------------------------------------------------------------------------
> -----Original Message-----
> From: Frank E Harrell Jr [mailto:fharrell at virginia.edu]
> Sent: Friday, October 26, 2001 3:01 AM
> To: s-news; rhelp
> Subject: [S] A speed improvement challenge
>
>
> I need to create a vector of probability-weighted randomly
> sampled values which each satisfy a different criterion.
> This causes the sampling weights to vary between elements
> of the vector. Here is some example code, with x, y,
> and freq being vectors each of length n. aty is a
> vector of length m.
>
> yinv <- double(m)
> for(i in 1:m) {
> s <- abs(y-aty[i]) < del
> yinv[i] <- if(any(s))
> sample(x[s], 1, replace=F, prob=freq[s]/sum(freq[s])) else
> approx(y, x, xout=aty[i], rule=2)$y
> }
>
> Big picture: For a tabulated function given by (x,y) and
> frequency of each occurrence of (x,y) given by the corresponding
> element in freq, find the inverse of the function (x as a function
> of y) for a vector of desired y-values aty[1],...aty[m]. If
> the function has a nearly flat segment, let the resulting x[i] value
> be a weighted randomly sampled x such that f(x) is within del of
> the target y-value aty. If no tabulated y is within del of the
> target y, use reverse linear interpolation to get y inverse.
> The reverse linear interpolation can easily be vectorized
> (approx(y, x, xout=aty, rule=2)$y).
>
> Thanks for any ideas.
> --
> Frank E Harrell Jr Prof. of Biostatistics & Statistics
> Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
> U. Virginia School of Medicine
> http://hesweb1.med.virginia.edu/biostat
> ---------------------------------------------------------------------
> This message was distributed by s-news at lists.biostat.wustl.edu. To
> unsubscribe send e-mail to s-news-request at lists.biostat.wustl.edu with
> the BODY of the message: unsubscribe s-news
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list