[R] Calculate Closest 5 Cases?

Jim_Garrett@bd.com Jim_Garrett at bd.com
Mon Feb 16 20:22:56 CET 2004


Here's another approach that doesn't use sorting.  Instead, after
calculating distances it considers a threshold on distance and counts how
many cases are within the threshold.  Then a search over thresholds is
conducted to find a threshold yielding the desired number of cases.  Then
case numbers satisfying the found threshold are returned.

Why go to this bother to avoid sorting?  Since the computation required to
sort N distances grows more quickly than the computation required to make a
pass through a vector of distances and count values less than a threshold,
for some N this approach ought to be more efficient than one based on
sorting.  Well, maybe, if the number of iterations grows slowly enough with
N.  Call it a conjecture.  On the other hand, the N for which this occurs
depends on the efficiency of the sort algorithm (pretty darned efficient
and using compiled code, I imagine) and the efficiency of the search
algorithm.  Here I use uniroot, which isn't really designed with step
functions in mind, so perhaps there are possible improvements.  So I
couldn't say whether this will work faster than the other suggestions
you've gotten with your data, or even for data one is likely to ever see.

If there are ties for the 6th-nearest case (counting the point itself),
this routine should either include or exclude all the ties, whichever comes
closest to yielding 5 points.

Here's the code:

## Args:
## CaseNo:     Row number of case for which to find nearest neighbors.
## x:          A numeric matrix.
## k:          The desired number of neighbors, not counting the point of
## TempIndex:  Index of row numbers.  If you use this function many many
times you
##             might gain an iota of efficiency by creating this vector
once and
##             passing it in as an argument with each call.  Probably not
worth the
##             trouble, I couldn't help myself....
## verbose:    Tells you how many iterations uniroot used, if you're
## Value:
## A vector of (hopefully) k row numbers indicating the k nearest
nearestKNeighbors <- function(CaseNo, x, k, TempIndex = 1:nrow(x), verbose
= F) {
  ## make sure x is a matrix
    stop("x must be a matrix.")

  TempVect <- x[CaseNo, ]
  SquaredDistances <- apply(x, 1, function(s) sum((s - TempVect)^2) )

  tempFun <- function(h) sum(SquaredDistances < h^2) - (k + 1)
  TempSolution <- uniroot(tempFun, interval = range(SquaredDistances))
    cat("Required", TempSolution$iter, "iterations.\n")

  sort(setdiff(TempIndex[SquaredDistances < TempSolution$root^2], CaseNo))

You would apply this to each row of your dataset using some variety of
"apply" or a loop.  I don't know if memory usage would differ.  In the
event of ties, you may not have 5 neighbors, so I might put the results in
a list, which can accommodate different lengths in each component (indeed,
it can accommodate completely different data structures):

## data in matrix temp.matrix
ListOutput <-
         function(s) nearestKNeighbors(s, temp.matrix, 5))


ListOutput <-
  vector(nrow(temp.matrix), mode = "list")
for(i in 1:nrow(temp.matrix))
  ListOutput[[i]] <- nearestKNeighbors(i, temp.matrix, 5)

and then you can manipulate the list however you like.  For instance, to
see if all components have length 5,

all(unlist(lapply(ListOutput, length)) == 5)

or, if there are such components, find out which one(s):

(1:length(ListOutput))[unlist(lapply(ListOutput, length)) != 5]

Welcome to R!


-Jim Garrett


> I've only begun investigating R as a substitute for SPSS.

> I have a need to identify for each CASE the closest (or most similar) 5
> other CASES (not including itself as it is automatically the closest).  I

> have a fairly large matrix (50000 cases by 50 vars).  In SPSS, I can use
> Correlate > Distances to generate a matrix of similarity, but only on a
> sample.  The entire matrix can not be processed at once due to memory

> The data are all percents, so they are easy comparable.

> Is there any way to do this in R?

> Below is a small sample of the data (from SPSS) and the desired output.

> Thanks,

> Danny

This message is intended only for the designated recipient(s...{{dropped}}

More information about the R-help mailing list