[BioC] reducing hits from countGenomicOverlaps()

Valerie Obenchain vobencha at fhcrc.org
Thu Oct 27 19:53:28 CEST 2011


Hi Robert,


On 10/26/2011 03:34 PM, Robert Castelo wrote:
> hi,
>
> On 10/27/11 12:07 AM, Martin Morgan wrote:
>> On 10/25/2011 05:35 PM, Robert Castelo wrote:
>>> dear list,
>>>
>>> the following three lines allow one to count overlaps of aligned
>>> short-reads with annotations:
>>>
>>> aln <- readGappedAlignments("somebamfile.bam")
>>> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene")
>>> ensGenes <- exonsBy(txdb, by="gene")
>>> ov <- countGenomicOverlaps(aln, ensGenes)
>>>
>>> then i want to get read-counts per gene and the first thing that comes
>>> to my head is doing:
>>>
>>> counts <- sapply(ov, function(x) sum(values(x)[["hits"]]))
>>>
>>> which goes through every gene and adds up the "hits" of its exons.
>>> however, this latter step of "just adding" takes longer than the actual
>>> calculation of the hits with countGenomicOverlaps() and i guess that
>>> there are more efficient ways to approach this, probably something
>>> around "reducing the hits value column". i've been looking at rdapply()
>>> and reduce() and googled too, but couldn't find anything, so i look
>>> forward to your suggestions.
>>
>> Hi Robert -- A strategy here is along the lines (untested!) of
>>
>> hits <- values(unlist(ov)))[["hits"]]
>> genes <- rep(names(ov), elementLengths(ov))
>> counts <- sapply(split(hits, genes), sum)
>
> beautiful and fast, i've just tested it with the ~ 50,000 Ensembl 
> genes and the execution time is nearly negligible, less than half a 
> second in my laptop.
>
>> but you'll want to make sure that this is a conceptually sensible way of
>> counting hits per gene.
>
> i'm aiming at counting exonic-reads and adding them up at gene level. 
> the function countGenomicOverlaps() allows me to tune the part of the 
> logic related to counting exonic reads, but is there any function that 
> would allow me to tune the summing of exonic reads?

How do you want to tune the summing? If you want to sum all exons by 
gene the piece of code Martin provided should do it. Was there another 
goal you had in mind?

Valerie
>
>> The countGenomicOverlaps function has been heavily updated to
>> summarizeOverlaps in 'devel'; you might want to check that out, or at
>> least the vignette for summarizeOverlaps at
>>
>> http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html
>
> that sounds great, i'll upgrade next week when it becomes release.
>
> thanks again!
> robert.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list