[BioC] multiple hits with countOverlaps function

Thu Apr 14 03:20:31 CEST 2011

On Apr 14, 2011, at 11:00 AM, Kasper Daniel Hansen wrote:

> On Wed, Apr 13, 2011 at 8:46 PM, Wei Shi <shi at wehi.edu.au> wrote:
>> The point is that the second read ([600, 700]) overlaps with both features and it was counted by both features. So the first feature ([100, 1000]) counts both reads but the second feature ([500, 1500] ) counts the second read again. Therefore, the second read was counted twice. In other words, there are only two reads in this example, but the total number of counts output from countOverlaps is three.
> 
> Yes, and I think this is entirely to be expected.  In all my
> use-cases, this is exactly what I want.
> 
> I dont get the "the second read was counted twice. "  It is the nature
> of the problem that reads have length > 1 and they can overlap
> multiple features and you need to thing about how you want to deal
> with this.  I assume you are looking at HTseq data, and I cannot
> really understand what you are trying to do.
> 
> Kasper

Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts.

However this might not be a problem for other types of sequencing data such as histone ChIP-seq because the marks could affect all the overlapping features. However, it will be useful if there is an option in countOverlaps which allows users decide how to deal with this overlapping feature issue.

Wei

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}