[BioC] find overlapping regions

Tue May 20 17:45:19 CEST 2008

Hi Marten,
you may want to have a look at the function regionOverlap in package 
Ringo, which is not elegant but probably faster since it uses (simple) C 
code for computing the overlap.
Regards,
Joern

M.Boetzer at lumc.nl wrote:
> Dear list,
>
> i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions:
>
> start = 133375983
> end = 146245512
>
> data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512))
> colnames(data) = c("start2", "end2")
>
>   
>> data
>>     
>      start2      end2
> 1 133470532 133754071
> 2 133966699 133969713
> 3 134162735 134163857
> 4 134236863 134249655
> 5 146225580 156245512
>
> I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down:
>
>
> regfound = c()
> reg1 = seq(start, end, 1)
>     for(i in 1:nrow(data)){
>       eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T)
>       if(eq_reg!=0)
>         regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1))
>       else
>         regfound = c(regfound,F)
>     } 
>
>   
>> regfound
>>     
> [1] 100.0 100.0 100.0 100.0   0.2
>
> Does anyone know a faster or more elegant way of doing this?
>
> Thanks in advance,
> Marten
>
>