[R] finding euclidean proximate points in two datasets

Thu May 20 18:18:45 CEST 2010

On May 20, 2010, at 11:12 AM, Alexander Shenkin wrote:

> On 5/20/2010 9:18 AM, David Winsemius wrote:
>>
>> On May 20, 2010, at 10:02 AM, Alexander Shenkin wrote:
>>
>>> Hello all,
>>>
>>> I've been pouring through the various spatial packages, but  
>>> haven't come
>>> across the right thing yet.
>>
>> There is a SIG for such questions.
>
> thanks - just joined it.
>
>>> Given a set of points in 2-d space X, i'm trying to find the  
>>> subset of
>>> points in Y proximate to each point in X.  Furthermore, the  
>>> proximity
>>> threshold of each point in X differs (X$threshold).  I've  
>>> constructed
>>> this myself already, but it's horrificly slow with a dataset of 40k+
>>> points in one set, and a 700 in the other.
>>>
>>> A very inefficient example of what I'm looking for:
>>
>> Not really a reproducible example. If euclidean_dist is a function ,
>> then it is not one in any of the packages I have installed.
>
> it's not reproducible - i'll make a better effort to include
> reproducible code in the future.  and that function is just one i  
> would
> have written, but didn't want to clog the email with code.  Anyway,  
> here
> is a reproducible example:
>
> X = data.frame(x=c(1,2,3), y=c(2,3,1), threshold=c(1,2,4))
> Y = data.frame(x=c(5,2,3,4,2,5,2,3), y=c(5,2,2,4,1,2,3,1))
> proximate=list()
> i=1
> for (pt in 1:length(X$x)) {
>    proximate[[i]] <- sqrt((X[pt,]$x - Y$x)^2 + (X[pt,]$y - Y$y)^2) >
> X[pt,]$threshold
>    i=i+1
> }
> proximate

This might be a more R-ish way of approaching the problem. It is not  
always the case that apply functions will be faster than looping, but  
they may be more compact and they get you ready for thinking about  
indexed solutions which would be generally faster.

 > apply(X, 1, function(.x) {  (.x[1] - Y$x )^2+(.x[2] - Y$y)^2  
<.x[3]^2 } )

       [,1]  [,2]  [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE  TRUE  TRUE
[3,] FALSE  TRUE  TRUE
[4,] FALSE FALSE  TRUE
[5,] FALSE FALSE  TRUE
[6,] FALSE FALSE  TRUE
[7,] FALSE  TRUE  TRUE
[8,] FALSE FALSE  TRUE

Transposing lets you check that the results are the same as your  
solution modulo yours and mine being slightly different classes:

 > t( apply(X, 1, function(.x) {  (.x[1] - Y$x )^2+(.x[2] - Y$y)^2  
<.x[3]^2 } ) )
       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
[3,] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

 > proximate
[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[[2]]
[1] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE

[[3]]
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

>
>>>
>>>   for (pt in X$idx) {
>>>       proximity[i] = euclidian_dist(X[pt]$x, X[pt]$y, Y$x, Y$y) <
>>> X$threshold
>>>    i = i+1
>>>   }
>>>
>>
>> Have you considered first creating a subset of candidate points  
>> that are
>> within "threshold" of each reference point on both coordinates. That
>> might sidestep a lot of calculations on points that are easily
>> eliminated on a single comparison. Then you could calculate distances
>> within that surviving subset of points. On average that should give  
>> you
>> an over 50% "hit rate":
>>
>>> (4/3)*pi*0.5^3
>> [1] 0.5235988
>
> That's a nice idea.  I'll still be waiting quite a while while my
> machine cranks, but not as long.  Still - I suspect there would be  
> much
> bigger gains if there were tailored packages.  I'll re-post over on
> sig-geo about that.  thanks.
>
>>> Perhaps crossdist() in spatstat is what I should use, and then  
>>> code a
>>> comparison with X$threshold after the cross-distances are computed.
>>> However, I was wondering if there was another tool I should be
>>> considering.  Any and all thoughts are very welcome.  Thanks in  
>>> advance.
>>>
>>> Thanks,
>>> Allie
>>> -- 
>>> Alexander Shenkin
>>> PhD Candidate
>>> School of Natural Resources and Environment
>>> University of Florida

David Winsemius, MD
West Hartford, CT