[BioC] reducing hits from countGenomicOverlaps()

Thu Oct 27 23:28:29 CEST 2011

On 10/27/2011 12:44 PM, Robert Castelo wrote:
> Hi Valerie,
>
> On 10/27/11 7:53 PM, Valerie Obenchain wrote:
>> Hi Robert,
>>
>>
>> On 10/26/2011 03:34 PM, Robert Castelo wrote:
>>> hi,
>>>
>>> On 10/27/11 12:07 AM, Martin Morgan wrote:
>>>> On 10/25/2011 05:35 PM, Robert Castelo wrote:
>>>>> dear list,
>>>>>
>>>>> the following three lines allow one to count overlaps of aligned
>>>>> short-reads with annotations:
>>>>>
>>>>> aln <- readGappedAlignments("somebamfile.bam")
>>>>> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene")
>>>>> ensGenes <- exonsBy(txdb, by="gene")
>>>>> ov <- countGenomicOverlaps(aln, ensGenes)
>>>>>
>>>>> then i want to get read-counts per gene and the first thing that 
>>>>> comes
>>>>> to my head is doing:
>>>>>
>>>>> counts <- sapply(ov, function(x) sum(values(x)[["hits"]]))
>>>>>
>>>>> which goes through every gene and adds up the "hits" of its exons.
>>>>> however, this latter step of "just adding" takes longer than the 
>>>>> actual
>>>>> calculation of the hits with countGenomicOverlaps() and i guess that
>>>>> there are more efficient ways to approach this, probably something
>>>>> around "reducing the hits value column". i've been looking at 
>>>>> rdapply()
>>>>> and reduce() and googled too, but couldn't find anything, so i look
>>>>> forward to your suggestions.
>>>>
>>>> Hi Robert -- A strategy here is along the lines (untested!) of
>>>>
>>>> hits <- values(unlist(ov)))[["hits"]]
>>>> genes <- rep(names(ov), elementLengths(ov))
>>>> counts <- sapply(split(hits, genes), sum)
>>>
>>> beautiful and fast, i've just tested it with the ~ 50,000 Ensembl
>>> genes and the execution time is nearly negligible, less than half a
>>> second in my laptop.
>>>
>>>> but you'll want to make sure that this is a conceptually sensible 
>>>> way of
>>>> counting hits per gene.
>>>
>>> i'm aiming at counting exonic-reads and adding them up at gene level.
>>> the function countGenomicOverlaps() allows me to tune the part of the
>>> logic related to counting exonic reads, but is there any function that
>>> would allow me to tune the summing of exonic reads?
>>
>> How do you want to tune the summing? If you want to sum all exons by
>> gene the piece of code Martin provided should do it. Was there another
>> goal you had in mind?
>
> well, Martin was warning me about whether just adding up exon counts 
> per gene makes complete sense and i guess the problem might arise with 
> overlapping exons of different transcripts since then by just summing 
> i'd be adding some counts twice. so i was wondering whether something 
> existed to address this summing along the lines of what is available 
> about the decision logic options for counting reads overlapping 
> annotations with countGenomicOverlaps().

Both the counting and summarizing problems have been addressed in the 
summarizedOverlaps function Martin mentioned. The function has different 
"modes"  that determine how to assign a read that overlaps multiple 
features. This function does not double count so a read is either 
assigned according to the rules of the "mode" or it is discarded. In the 
end you have counts per feature. A feature can be represented by a 
single range (ie, single exon, transcript or gene) or by multiple ranges 
(ie, exons in a gene, transcripts in a gene, etc.).  I'd be interested 
in feedback once you've had a chance to use the function.

Valerie

>
> cheers,
> robert.