[BioC] find overlap of bed files of different length

Martin Morgan mtmorgan at fhcrc.org
Tue Feb 1 23:23:12 CET 2011


On 02/01/2011 11:08 AM, Michael Lawrence wrote:
> On Tue, Feb 1, 2011 at 10:35 AM, Duke <duke.lists at gmx.com> wrote:
> 
>>  On 2/1/11 1:11 PM, Michael Lawrence wrote:
>>
>>
>>
>> On Tue, Feb 1, 2011 at 7:06 AM, Duke <duke.lists at gmx.com> wrote:
>>
>>> On 1/31/11 1:20 PM, Kasper Daniel Hansen wrote:
>>>
>>>> Use findOverlaps to find all cases.  This is usually the hard and big
>>>> computation.  Then use for example pintersect() to compute the actual
>>>> overlap in percent.  There might be some tedious coding involved.
>>>>
>>>
>>>  Thanks for your suggestion Kasper, though honestly I have not tried it
>>> yet. But based on what Martin and you suggested, I thought the final code
>>> will not run fast because of extracting to strand/subset and running each.
>>> Especially my task is a little more complicated: I need to find gene
>>> expressions (counting sequences in exonic regions of each gene). I also gave
>>> BEDTools a try, but it does not fulfil my needs (extremely slow for a gene
>>> list of 28k).
>>>
>>> I ended up with coding a c++ code to do the job. Thanks for all of your
>>> suggestions and helps guys.
>>>
>>>
>> It would be nice to have a little more detail about what you needed. If
>> findOverlaps and friends aren't doing the job, it would be good to know.
>> Counting reads in exons of genes is as simple as calling countOverlaps on
>> the GRangesList of the exons.
>>
>>
>> Hi Micheal,
>>
>> My task is to count the reads of a bed file of different length in exons of
>> genes with a controllable overlap option (by percentage, not by bases). Some
>> people want to count it with overlap=100% length of reads, but some other
>> might want to count it with 20% for example. This option should be very
>> similar to minOverlap, but in percentage instead of bases.
>>
>>
> This is a reasonable request. As Kasper mentioned, it's possible with post
> processing.
> 
> E.g.:
> 
> m <- findOverlaps(query, subject)
> percentOverlap <- width(ranges(m, query, subject)) /
> width(query)[queryHits(m)]
> keep <- percentOverlap > cutoff

there are rough edges, e.g., (G)RangesList/(G)RangesList, but yes this
will make it into GenomicRanges. Martin

> 
> Perhaps someone up North could add this to IRanges/GenomicRanges?
> 
> Michael
> 
> D.
>>
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list