[BioC] multiple hits with countOverlaps function

Wei Shi shi at wehi.EDU.AU
Thu Apr 14 02:46:38 CEST 2011


On Apr 14, 2011, at 10:32 AM, Kasper Daniel Hansen wrote:

> On Wed, Apr 13, 2011 at 7:43 PM, Wei Shi <shi at wehi.edu.au> wrote:
>> Hi,
>> 
>>        I am using countOverlaps function in GenomicFeatures package to count how many sequencing reads are mapped to each genomic feature (e.g. promoter region, gene region etc.). However I found that if a read was mapped to more than one features, then the count of each feature will be increased by 1, i.e. this read was counted more than one times although it can only originate from one genomic location. Below is a simple example to show this:
>> 
>> -------------------------
>> features <- GRanges(seqnames="chr1", ranges=IRanges(c(100,500),c(1000,1500)) ,strand=c("+","+"))
>> features
>> GRanges with 2 ranges and 0 elementMetadata values
>>    seqnames      ranges strand |
>>       <Rle>   <IRanges>  <Rle> |
>> [1]     chr1 [100, 1000]      + |
>> [2]     chr1 [500, 1500]      + |
>> 
>> seqlengths
>>  chr1
>>   NA
>> reads <- GRanges(seqnames="chr1",ranges=IRanges(c(150,600),c(250,700)),strand=c("+","+"))
>> GRanges with 2 ranges and 0 elementMetadata values
>>    seqnames     ranges strand |
>>       <Rle>  <IRanges>  <Rle> |
>> [1]     chr1 [150, 250]      + |
>> [2]     chr1 [600, 700]      + |
>> 
>> seqlengths
>>  chr1
>>   NA
>> countOverlaps(features,reads)
>> [1] 2 1
>> --------------------
>> 
>> I think It will be a better to have an argument in countOverlaps function which specifies whether a single hit (eg. a random one) or multiple hits should be returned for each query sequence.
> 
> I don't see the point at all.  In your case [100,1000] clearly
> overlaps both [150,250] and [600,700].  countOverlaps should return 2.
> If you insist on reporting a random hit, countOverlaps would always
> return 0 or 1 and never a bigger number.  What is the point in this?
> 
> What is true for countOverlaps(query, subject)  is that it is not
> symmetric wrt. query and subject and this is typically essential to
> understand.
> 
> Kasper

The point is that the second read ([600, 700]) overlaps with both features and it was counted by both features. So the first feature ([100, 1000]) counts both reads but the second feature ([500, 1500] ) counts the second read again. Therefore, the second read was counted twice. In other words, there are only two reads in this example, but the total number of counts output from countOverlaps is three. 

Hope this clarifies.

Wei


______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}



More information about the Bioconductor mailing list