[R] Summary: Vectorizing closest match

Frank E Harrell Jr fharrell at virginia.edu
Thu Mar 28 23:54:40 CET 2002


The original problem I posed was

Let

 x = real vector of length n
 y = real vector of length n
 w = real vector of length m, m typically less than n/2 but can be > n
 z = real vector of length m

 For w[i], i=1,,,m, find the value of x that is closest to w[i].  In the
 case of ties, select one (optimally at random or just take the first
 match).  Let z[i] = value of y corresponding to the closest x.

I received several helpful replies.  Peter Dalgaard suggested the use of approx.  approx always amazes me.  To return the index of the closest match you can use round(approx(x,1:length(x),xout=w,rule=2,ties='ordered')$y)
For S-Plus just remove ties=.

David James suggested using cut.  I adapted his code and speeded it up as he suggested, by going right to the .C call in cut.default (set global variable .R. to TRUE for R, FALSE for S-Plus):

whichClosest <- function(x, w) {
  ## x: vector of reference values
  ## w: vector of values to find closest matches in x
  ## Returns: subscripts in x corresponding to w
  i <- order(x)
  x <- x[i]
  n <- length(x)
  br <- c(-1e30, x[-n]+diff(x)/2,1e30)
  m <- length(w)
  if(.R.) i[.C("bincode", as.double(w), m, as.double(br),
               length(br), code = integer(m), right = TRUE, 
               include = FALSE, NAOK = TRUE, DUP = FALSE, 
               PACKAGE = "base")$code] else
  i[.C("S_binning3", x=as.double(w), m, as.double(br),
       length(br), 0, 0, TRUE, TRUE)$x]  # For S-Plus
}

Note that for large n, cut.default is extremely slow.

Thomas Lumley had a nice approach using a new function findInterval.  All three approaches are extremely fast.  The main difference between approx and the cut approach is where ties are shuffled.  I plan to use approx.  To randomize choices in case of ties one can jitter the x vector.

Thanks again all,

Frank
-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  http://hesweb1.med.virginia.edu/biostat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list