[BioC] multiple hits with countOverlaps function

Wei Shi shi at wehi.EDU.AU
Thu Apr 14 04:07:31 CEST 2011


Hi:

	No one knows which features is the right one. But my view is that assigning a read to a random feature is better than assigning it to all the overlapping ones. Your code can certainly find a random feature for the read, but adding an option to the function might be better.
	
On Apr 14, 2011, at 11:56 AM, Steve Lianoglou wrote:

> Hi,
> 
> On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi <shi at wehi.edu.au> wrote:
> [snip]
>>> I dont get the "the second read was counted twice. "  It is the nature
>>> of the problem that reads have length > 1 and they can overlap
>>> multiple features and you need to thing about how you want to deal
>>> with this.  I assume you are looking at HTseq data, and I cannot
>>> really understand what you are trying to do.
>>> 
>>> Kasper
>> 
>> Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts.
> 
> Yes, it can only originate from one, but how would countOverlaps know which one?
> 
> How does anyone know which one?
> 
> As far as I know, there is no standard way to deal with this, so the
> burden is on you.
> 
> If you have an idea of "what's right" in this situation, why don't you
> rather ask the opposite question first, which is: "Which reads overlap
> more than one feature?"
> 
> R> features.per.read <- countOverlaps(reads, features)
> 
> Then do your "naive"  counting/quantifying w/ just the "easy" case:
> 
> R> reads.per.feature <- countOverlaps(features,
> features.per.read[features.per.read == 1])
> 
> Now you can figure out what you want to do for features.per.read > 1,
> because picking one feature to assign the read to at random by default
> is clearly wrong, right?
> 
> What if the gene/isoform that the read overlaps with isn't expressed
> at all, you'd be assigning reads to it "just because" in your
> scenario.
> 
> -- 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact


______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}



More information about the Bioconductor mailing list