[BioC] multiple hits with countOverlaps function

Thu Apr 14 04:17:55 CEST 2011

On Apr 14, 2011, at 11:34 AM, Kasper Daniel Hansen wrote:

> On Wed, Apr 13, 2011 at 9:19 PM, Wei Shi <shi at wehi.edu.au> wrote:
>> Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts.
> 
> Why?  I mean, I see what you mean from an assay point of view.  But
> clearly it is highly non-trivial to decide which of the two genes you
> want to assign the read to.  People have proposed methods for this but
> it is certainly not resolved.  I would argue that the first step is to
> divide the 2 genes into 3 genomic regions, gene1 only, gene2 only and
> both gene1 and gene2, count the overlap and then proceed from there.
> 
> Doing a (uniform) random assignment is clearly wrong (for this
> particular use case).
> 
> Finally, this cannot be the situation your example is referring to,
> since you did not have overlapping features.  You were instead
> implying that if the first read hit feature A, the second read should
> not be allowed to hit feature A.  (but that is perhaps not too
> important).
> 
> I might be argumentative, but you seem to imply that this
> "unresolveness" should be handled by a very simple option to
> countOverlaps and I would argue that this is not a solution.  Indeed I
> would argue that users who do not wish to think hard about this are
> ignoring a possibly important (and possibly unimportant)  issue with
> the data.
> 
> Kasper

Please bring your posts to the mailing list, Kasper.

The two features in my email are overlapping. I re-paste the features below. 

>> features <- GRanges(seqnames="chr1", ranges=IRanges(c(100,500),c(1000,1500)) ,strand=c("+","+"))
>> features
>> GRanges with 2 ranges and 0 elementMetadata values
>>   seqnames      ranges strand |
>>      <Rle>   <IRanges>  <Rle> |
>> [1]     chr1 [100, 1000]      + |
>> [2]     chr1 [500, 1500]      + |
> 

I did not make any implication on which read should hit which feature. The purpose of that example was just to show that the read which hits multiple features will be counted by each of them.

Also, I did not imply choosing a random feature will solve the multi-feature hit problem and I do not think it will solve this problem. However, I think assigning the read to a random feature is better than assigning it to all of them. 

Wei

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}