[BioC] find overlapping regions

Martin Morgan mtmorgan at fhcrc.org
Tue May 20 15:35:02 CEST 2008


Hi Marten --

<M.Boetzer at lumc.nl> writes:

> Dear list,
>
> i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions:
>
> start = 133375983
> end = 146245512
>
> data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512))
> colnames(data) = c("start2", "end2")
>
>> data
>      start2      end2
> 1 133470532 133754071
> 2 133966699 133969713
> 3 134162735 134163857
> 4 134236863 134249655
> 5 146225580 156245512
>
> I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down:
>
>
> regfound = c()
> reg1 = seq(start, end, 1)
>     for(i in 1:nrow(data)){
>       eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T)
>       if(eq_reg!=0)
>         regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1))
>       else
>         regfound = c(regfound,F)
>     } 
>
>>regfound
> [1] 100.0 100.0 100.0 100.0   0.2

Probably the key is to simplify how the overlapping region is found,
and then to vectorize the calculation.  Maybe something along the
lines of

> width <- data$end2 - data$start2
> olap <- (pmin(end, data$end2) - pmax(start, data$start2)) / width
> olap > .5
[1]  TRUE  TRUE  TRUE  TRUE FALSE

?

Martin

>
> Does anyone know a faster or more elegant way of doing this?
>
> Thanks in advance,
> Marten
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioconductor mailing list