[BioC] multiple feature/mode counting with summarizeOverlaps

Fri May 24 00:39:11 CEST 2013

Thanks Varlerie for looking into these suggestions and improvements. I
really appreciate the time and effort you keep investing into my
unsolicited questions. - Obviously, read counting has become a major
activity for many of us and your software will definitley put to great
use. In our projects, flexiblity and accuracy of the read counter is
most important to us, which summarizeOverlaps and associates address
very well. Since research facilities like ours have to perform those
analyses on very large disk storage devices with several hundred TBs of
space that are shared among many NGS users, disk I/O and network speeds
on compute clusters have become a major bottleneck. Thus, we are also
very interested in solutions that collect counts for many feature types
and counting modes in a single pass-through of these big files, as
opposed to accessing them over and over again when it is not really
necessary. Usually, I also like to delete the bam files early on in most
of our RNA-Seq workflows to save storage space. However, our
collaborators often keep coming up with all kinds of creative and
exciting new ideas what features/modes to count on where the best
analysis strategy is often to generate by default counts for many
feature types right away (even if I don't intend and recommend or to use
them) rather than only one prescibed solution that usually ends up not
being enough. I guess RNA-Seq is still very much in an engineering stage
where most of us depend on software design that provides a very high
level of flexiblitly.

Again, thanks a lot.

Thomas

On Thu, May 23, 2013 at 09:06:10PM +0000, Valerie Obenchain wrote:
> Hi Thomas,
> 
> On 05/20/2013 10:12 PM, Thomas Girke wrote:
> > I apologize up front to Valarie for posting additional questions related
> > to her excellent read counter. This time my questions are related to the
> > economics of accessing big files as few times as possible to improve
> > workflow performance in large RNA-Seq projects and to subsequently allow
> > deleting all input bam files since BIG DATA is killing us :).
> >
> > (1) What is currently the intended approach to obtain counts for
> > multiple feature types in a single pass-through of one or many bam
> > files?  For instance, often we want to count the reads overlapping with
> > many different features types (not just one), such as exonBygenes,
> > cdsBygenes, UTRs, introns and intergenics. Computing all those feature
> > counts sequentially in a loop is easy, but can be time consuming and
> > heavy on disk I/O (the most expensive resource in NGS analysis) and
> > internal networks when running many counting processes in parallel. As a
> > partial solution to this, I often organize all feature types in one big
> > GRanges/GRangesList container, perform the counting and then split the
> > resulting count tables accordingly. However, is this really the best way
> > of doing this?  A major limitation of this approach is that it does not
> > support feature overlap aware counting modes.
> 
> I would not recommend combining different annotations into a single
> GRanges/GRangesList. To summarizeOverlaps(), each row of a GRanges or 
> each list element of a GRangesList represents a feature. If a read hits 
> more than one feature we have a double hit situation that must be 
> resolved. If you're interested in counting both exonsByGene and 
> transcriptsByGene you probably want these to be independent counts. By 
> combining them into one object they can be confounding and create 
> multiple hits situations where there really aren't any.
> 
> Toy example, but let's say 'gr1' comes from exonsBy() and 'gr2' from 
> 3UTRsByTranscript().
> 
> gr1 <- GRanges(c("chr1", "chr1"), IRanges(c(1, 10), width=10))
> gr2 <- GRanges("chr1", IRanges(25, width=10))
> grl <- GRangesList(gr1, gr2)
> reads <- GAlignments("chr1", pos=18L, cigar="10M", strand=strand("*"))
> 
> 
> The read hits gr1 and gr2. If counted separately,
> 
>  > assays(summarizeOverlaps(grl[1], reads, Union))$counts
>       reads
> [1,]     1
>  > assays(summarizeOverlaps(grl[2], reads, Union))$counts
>       reads
> [1,]     1
> 
> Now count as GRangesList,
> 
>  > assays(summarizeOverlaps(grl, reads, Union))$counts
>       reads
> [1,]     0
> [2,]     0
> 
> 
> Using 'inter.feature=FALSE' will regain the hits with modes Union and 
> IntersectionStrict but not necessarily IntersectionNotEmpty because 
> shared regions of the annotation are removed before counting.
> 
>  > assays(summarizeOverlaps(grl, reads, Union, inter.feature=FALSE))$counts
>       reads
> [1,]     1
> [2,]     1
> 
> >
> > (2) Similarly, what if I want to generate counts for multiple counting
> > modes in a single call to summarizeOverlaps, such as different overlap
> > modes, or sense and antisense counts which we often want to report when
> > working with strand specific RNA-Seq data? Are there plans to support
> > this? BTW: currently, I generate antisense read counts by inversing the
> > strand information of the features which works fine just not for both
> > ways in one step.
> 
> We don't have a built-in method to count Bams over multiple annotations 
> and multiple count modes. There are several approaches you could take. 
> After some discussion this was one we came up with.
> 
>      ## Count a single file.
>      count1bam <-
>          function(file, features, mode, ...,
>                   yieldSize=1000000L, reader=readGAlignments,
>                   param=ScanBamParam())
>      {
>          ## initialize
>          bf <- open(BamFile(file, yieldSize=yieldSize))
>          counts <- lapply(sapply(features, length), integer)
> 
>          ## iterate
>          while(length(reads <- reader(bf, param=param)))
>              for (i in seq_along(features))
>                  counts[[i]] <-
>                      counts[[i]] + mode(features[[i]], reads, ...)
>          close(bf)
>          counts
>      }
> 
>      ## Count over files in parallel.
>      countbams <-
>          function(features, files, ...)
>      {
>          counts0 <- mclapply(files, count1bam, features=features, ...)
>          counts <- lapply(names(features), function(i, counts) {
>              sapply(counts0, "[[", i)
>          }, counts0)
>      }
> 
>      ## In practice.
>      library(parallel); options(mc.cores=detectCores())
>      library(GenomicRanges)
>      library(Rsamtools)
>      features <-
>          list(coarse=GRanges("seq1", successiveIRanges(rep(100, 15))),
>               fine=GRanges("seq2", successiveIRanges(rep(20, 45))))
>      fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>      files <- list(fl, fl)  ## usually different files
> 
>      ## As 'mode' you can directly call Union, IntersectionStrict,
>      ## and IntersectionNotEmpty.
>      result <- countbams(features, files, mode=Union,
>                          ignore.strand=TRUE, inter.feature=FALSE)
> 
>      ## 'reader' specifies how the data are read in. For paired-end
>      ## use readGAlignmentsList (qname sorted or readGAlignmentPairs.
>      result <- countbams(features, files, mode=Union,
>                          reader=readGAlignmentPairs)
> 
> This code handles the case of multiple Bam with multiple annotations. To 
> handle combinations for multiple arguments ('mode', 'ignore.strand', 
> 'inter.feature', etc.) you could add a Map() in the 'while' loop in 
> count1bam. Instead of passing a single value you would pass lists the 
> same length as the list of annotations. For example,
> 
> modeList <- c(Union, IntersectionStrict)
> ignoreStrandList <- c(TRUE, TRUE)
> interFeatureList <- c(TRUE, FALSE)
> 
> >
> > (3) Is it currently possible to return the total numbers of aligned
> > reads in bam files with summarizeOverlaps and store them in the
> > resulting counting object? There are many places from where the numbers
> > of aligned reads can be obtained, but to me the most obvious step to
> > generate and store them would be right here. They are useful parameters
> > for alignment QC'ing and reporting proper RPKM/FPKM values to biologists.
> >
> 
> I've added the output of countBam() to the 'colData' slot of the 
> SummarizedExperiment. The 'mapped' counts are computed with countBam() 
> and a param where 'isUnmappedQuery=FALSE'. This isn't a great example 
> but does demonstrate the new output.
> 
> library(Rsamtools)
> library(pasillaBamSubset)
> library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
> exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> se <- summarizeOverlaps(exbygene, BamFile(untreated1_chr4()))
> 
>  > colData(se)
> DataFrame with 1 row and 3 columns
>                        records nucleotides    mapped
>                      <integer>   <numeric> <integer>
> untreated1_chr4.bam    204355    15326625    204355
> 
> 
> 
> Code updates are in GenomicRanges 1.13.15 and Rsamtools 1.13.16.
> 
> Valerie
> 
> > Thomas
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >