[BioC] multiple hits with countOverlaps function

Mete Civelek mcivelek at mednet.ucla.edu
Thu Apr 14 02:56:54 CEST 2011


Is it possible to implement an option that will do weighted counts, i.e. if
a given sequence maps to two locations then each is counted as 0.5,
similarly if it maps to three locations, each mapping is counted as 1/3?

-----Original Message-----
From: bioconductor-bounces at r-project.org
[mailto:bioconductor-bounces at r-project.org] On Behalf Of Wei Shi
Sent: Wednesday, April 13, 2011 5:47 PM
To: Kasper Daniel Hansen
Cc: Bioconductor
Subject: Re: [BioC] multiple hits with countOverlaps function


On Apr 14, 2011, at 10:32 AM, Kasper Daniel Hansen wrote:

> On Wed, Apr 13, 2011 at 7:43 PM, Wei Shi <shi at wehi.edu.au> wrote:
>> Hi,
>>
>>        I am using countOverlaps function in GenomicFeatures package to
count how many sequencing reads are mapped to each genomic feature (e.g.
promoter region, gene region etc.). However I found that if a read was
mapped to more than one features, then the count of each feature will be
increased by 1, i.e. this read was counted more than one times although it
can only originate from one genomic location. Below is a simple example to
show this:
>>
>> -------------------------
>> features <- GRanges(seqnames="chr1",
ranges=IRanges(c(100,500),c(1000,1500)) ,strand=c("+","+"))
>> features
>> GRanges with 2 ranges and 0 elementMetadata values
>>    seqnames      ranges strand |
>>       <Rle>   <IRanges>  <Rle> |
>> [1]     chr1 [100, 1000]      + |
>> [2]     chr1 [500, 1500]      + |
>>
>> seqlengths
>>  chr1
>>   NA
>> reads <-
GRanges(seqnames="chr1",ranges=IRanges(c(150,600),c(250,700)),strand=c("+","
+"))
>> GRanges with 2 ranges and 0 elementMetadata values
>>    seqnames     ranges strand |
>>       <Rle>  <IRanges>  <Rle> |
>> [1]     chr1 [150, 250]      + |
>> [2]     chr1 [600, 700]      + |
>>
>> seqlengths
>>  chr1
>>   NA
>> countOverlaps(features,reads)
>> [1] 2 1
>> --------------------
>>
>> I think It will be a better to have an argument in countOverlaps function
which specifies whether a single hit (eg. a random one) or multiple hits
should be returned for each query sequence.
>
> I don't see the point at all.  In your case [100,1000] clearly
> overlaps both [150,250] and [600,700].  countOverlaps should return 2.
> If you insist on reporting a random hit, countOverlaps would always
> return 0 or 1 and never a bigger number.  What is the point in this?
>
> What is true for countOverlaps(query, subject)  is that it is not
> symmetric wrt. query and subject and this is typically essential to
> understand.
>
> Kasper

The point is that the second read ([600, 700]) overlaps with both features
and it was counted by both features. So the first feature ([100, 1000])
counts both reads but the second feature ([500, 1500] ) counts the second
read again. Therefore, the second read was counted twice. In other words,
there are only two reads in this example, but the total number of counts
output from countOverlaps is three.

Hope this clarifies.

Wei


______________________________________________________________________
The information in this email is confidential and intend...{{dropped:19}}



More information about the Bioconductor mailing list