[R] Finding overlaps in vector

Johannes Graumann johannes_graumann at web.de
Fri Dec 21 20:59:29 CET 2007

```Jim,

Although I can't find the post this code stems from, I had come across it on
my prowling the NG. It's not the one you had shared with me to eliminate
overlaps (and which I referenced below:
http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html). That
particular solution you had come up with marked entries as overlapping or
not, and I am looking for an extension to that code which would also return
the actual "clusters" of consecutively overlapping values. While Gabor's
code in this thread does what I require for the example I still hope
somebody more cluefull than myself can extent your code since it carries
the - for me - significant advantage of being able to build the windows of
overlap with different values for 'up' and 'down', let's say check which
values overlap when the overlap-defining distance is 5ppm 'up' and
7.5ppm 'down' from each value. This is a generalization I would highly
cherish.

Thanks for your help and continuous patience on r-help.

Joh

jim holtman wrote:

> Here is a modification of the algorithm to use a specified value for
> the overlap:
>
>> vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8)
>> # following add 0.5 as the overlap detection -- can be changed
>> x <- rbind(cbind(value=vector, oper=1, id=seq_along(vector)),
> +            cbind(value=vector+0.5, oper=-1, id=seq_along(vector)))
>> x <- x[order(x[,'value'], -x[, 'oper']),]
>> # determine which ones overlap
>> x <- cbind(x, over=cumsum(x[, 'oper']))
>> # now partition into groups and only use groups greater than or equal to
>> # 3 determine where the breaks are (0 values in cumsum(over))
>> x <- cbind(x, breaks=cumsum(x[, 'over'] == 0))
>> # delete entries with 'over' == 0
>> x <- x[x[, 'over'] != 0,]
>> # split into groupd
>> x.groups <- split(x[, 'id'], x[, 'breaks'])
>> # only keep those with more than 2
>> x.subsets <- x.groups[sapply(x.groups, length) >= 3]
>> # print out the subsets
>> invisible(lapply(x.subsets, function(a) print(vector[unique(a)])))
> [1] 0.00 0.45
> [1] 3.00 3.25 3.33 3.75 4.10
> [1] 6.00 6.45
> [1] 7.0 7.1
>
>
> On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de>
> wrote:
>> <posted & mailed>
>>
>> Dear all,
>>
>> I'm trying to solve the problem, of how to find clusters of values in a
>> vector that are closer than a given value. Illustrated this might look as
>> follows:
>>
>> vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8)
>>
>> When using '0.5' as the proximity requirement, the following groups would
>> result:
>> 0,0.45
>> 3,3.25,3.33,3.75,4.1
>> 6,6.45
>> 7,7.1
>>
>> Jim Holtman proposed a very elegant solution in
>> http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have
>> modified and perused since he wrote it to me. The beauty of this approach
>> is that it will not only work for constant proximity requirements as
>> above, but also for overlap-windows defined in terms of ppm around each
>> value. Now I have an additional need and have found no way (short of
>> iteratively step through all the groups returned) to figure out how to do
>> that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are
>> separate clusters?
>>
>> Thanks for any hints, Joh
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help