[BioC] multiple feature/mode counting with summarizeOverlaps

Tue May 21 07:12:32 CEST 2013

I apologize up front to Valarie for posting additional questions related
to her excellent read counter. This time my questions are related to the
economics of accessing big files as few times as possible to improve
workflow performance in large RNA-Seq projects and to subsequently allow
deleting all input bam files since BIG DATA is killing us :). 

(1) What is currently the intended approach to obtain counts for
multiple feature types in a single pass-through of one or many bam
files?  For instance, often we want to count the reads overlapping with
many different features types (not just one), such as exonBygenes,
cdsBygenes, UTRs, introns and intergenics. Computing all those feature
counts sequentially in a loop is easy, but can be time consuming and
heavy on disk I/O (the most expensive resource in NGS analysis) and
internal networks when running many counting processes in parallel. As a
partial solution to this, I often organize all feature types in one big
GRanges/GRangesList container, perform the counting and then split the
resulting count tables accordingly. However, is this really the best way
of doing this?  A major limitation of this approach is that it does not
support feature overlap aware counting modes. 

(2) Similarly, what if I want to generate counts for multiple counting
modes in a single call to summarizeOverlaps, such as different overlap
modes, or sense and antisense counts which we often want to report when
working with strand specific RNA-Seq data? Are there plans to support
this? BTW: currently, I generate antisense read counts by inversing the
strand information of the features which works fine just not for both
ways in one step. 

(3) Is it currently possible to return the total numbers of aligned
reads in bam files with summarizeOverlaps and store them in the
resulting counting object? There are many places from where the numbers
of aligned reads can be obtained, but to me the most obvious step to
generate and store them would be right here. They are useful parameters
for alignment QC'ing and reporting proper RPKM/FPKM values to biologists.

Thomas