[BioC] multiple hits with countOverlaps function

Wei Shi shi at wehi.EDU.AU
Thu Apr 14 04:33:13 CEST 2011


Also I want to mention that choosing a random feature will eventually make all overlapping features have the the same number of such reads too if the total number of such reads is large enough. This should give a less biased read count summarization compared to assigning them to all features (which have the same numbers of such reads too) because total number of reads assigned to features will be the same with the original total when only random features are chosen.

On Apr 14, 2011, at 11:56 AM, Steve Lianoglou wrote:

> Hi,
> 
> On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi <shi at wehi.edu.au> wrote:
> [snip]
>>> I dont get the "the second read was counted twice. "  It is the nature
>>> of the problem that reads have length > 1 and they can overlap
>>> multiple features and you need to thing about how you want to deal
>>> with this.  I assume you are looking at HTseq data, and I cannot
>>> really understand what you are trying to do.
>>> 
>>> Kasper
>> 
>> Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts.
> 
> Yes, it can only originate from one, but how would countOverlaps know which one?
> 
> How does anyone know which one?
> 
> As far as I know, there is no standard way to deal with this, so the
> burden is on you.
> 
> If you have an idea of "what's right" in this situation, why don't you
> rather ask the opposite question first, which is: "Which reads overlap
> more than one feature?"
> 
> R> features.per.read <- countOverlaps(reads, features)
> 
> Then do your "naive"  counting/quantifying w/ just the "easy" case:
> 
> R> reads.per.feature <- countOverlaps(features,
> features.per.read[features.per.read == 1])
> 
> Now you can figure out what you want to do for features.per.read > 1,
> because picking one feature to assign the read to at random by default
> is clearly wrong, right?
> 
> What if the gene/isoform that the read overlaps with isn't expressed
> at all, you'd be assigning reads to it "just because" in your
> scenario.
> 
> -- 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact


______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}



More information about the Bioconductor mailing list