[R] Calculate Closest 5 Cases?

Fri Feb 13 21:37:18 CET 2004

<dsheuman at rogers.com> writes:

> I've only begun investigating R as a substitute for SPSS.
> 
> I have a need to identify for each CASE the closest (or most similar) 5 
> other CASES (not including itself as it is automatically the closest).  I 
> have a fairly large matrix (50000 cases by 50 vars).  In SPSS, I can use Correlate > Distances to generate a matrix of similarity, but only on a small sample.  The entire matrix can not be processed at once due to memory limitations.
> 
> The data are all percents, so they are easy comparable.  
> 
> Is there any way to do this in R?
> 
> Below is a small sample of the data (from SPSS) and the desired output.
> 
> Thanks,
> 
> Danny

This seems to be close:

d <- read.table("tempfile") # needed to edit to get 12 items per line.
close6 <- function(r)
  d$V1[order(apply(d[-1],1,
                   function(r2)dist(rbind(r,r2))))][1:6]
t(apply(d[-1],1,close6))

       [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
1  10170069 11010422 11460001 11070078 12660644 11790016
2  10190229 11780034 11460001 10170069 11650133 11070078
3  10540023 12660644 10662074 11060762 12661667 11070078
4  10650413 11180646 11780034 11790016 10662074 11460001
5  10662074 11060762 10650413 12660338 11180646 10540023
6  10770041 11790016 11650275 11010422 11460001 11180646
7  11010422 10170069 11650275 11460001 11060762 11790016
8  11060762 10662074 12660338 12661667 11010422 11460001
9  11070078 11460001 12660644 10170069 11780034 12660338
10 11180646 10650413 11780034 11790016 11460001 10662074
11 11460001 11790016 11070078 11780034 10650413 10170069
12 11650133 11780034 12660644 11060762 10650413 11460001
13 11650275 11010422 11460001 11790016 11180646 10770041
14 11780034 11650133 11180646 10650413 11790016 11460001
15 11790016 11460001 11180646 11780034 10650413 10770041
16 12660338 11060762 10662074 11650275 11070078 10650413
17 12660644 10540023 11650133 11780034 11070078 11060762
18 12661667 11060762 10662074 10540023 11010422 12660644

Notice that I use a function to get the closest *6* ID's because the
method will include the row itself. If multiple rows have distance
zero, this might be a problem since you're not guaranteed to get the
ID of the "self" row sorted first.

Here's another try:

close5 <- function(i)
  d$V1[-i][order(apply(d[-i,-1],1,function(r)dist(rbind(d[i,-1],r))))[1:5]]

do.call("rbind",lapply(1,nrow(d),close5))

However, for some reason this is much slower. Getting rid of the more
obvious inefficiencies (some of which would really kill you on a large
data set since they involve copying the entire data frame!) doesn't
really help:

dd <- d[-1]
close5 <- function(i) {r1 <- dd[i,];
d$V1[-i][order(apply(dd,1,function(r)dist(rbind(r1,r)))[-i])[1:5]]}

>system.time(do.call("rbind",lapply(1:nrow(d),close5)))
[1] 1.67 0.00 1.67 0.00 0.00

whereas

> system.time(t(apply(d[-1],1,close6)))
[1] 0.23 0.00 0.23 0.00 0.00

Anyone have a better idea, or just an explanation of the slowness? 

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907