[BioC] GenomicAlignments - Speeding up readGAlignmentPairs()

Thu Jul 3 19:42:51 CEST 2014

No, the division by 2 isn't always safe. 'ex1.bam' is a well-behaved 
file. Files with more diversity may have more than 2 segments per template.

counts <- lapply(fls, function(fl)
   countBam(fl, param=param)$records)

pairs <- lapply(fls, function(fl)
   length(readGAlignmentPairs(fl, param=param)))

>> unlist(counts)/2
> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 ERR127305
>    363326    392420    397158    349970    355376    381493    389176    355834
>> unlist(pairs)
> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 ERR127305
>    363147    392252    396933    349822    355215    381274    388991    355682

To see what's going on use readGAlignementsList() on the first file. 
This function adds some flexibility in that it reads in fragments, pairs 
w/ no mates, etc.

galist <- readGAlignmentsList(fl, param=param)

The length matches that of countBam() because it reads all records.

>> length(galist)
> [1] 363226

'mate_status' categorizes records by 'mated', 'ambiguous' and 'unmated'. 
Here we see the 363147 match to output from readGAlignmentsPairs() and 
the rest are ambiguous.

>> table(mcols(galist)$mate_status)
>
>     mated ambiguous   unmated
>    363147        79         0

elementLengths() shows the number of segments per template. Again we see 
the 363147 match and it shows exactly 2 segments for those.

>> table(elementLengths(galist))
>
>      2      4      6      8
> 363147     65      7      7

Valerie

On 07/03/2014 10:08 AM, Fong Chun Chan wrote:
> Thanks for the reply. I understand what you mean about counting aligned
> pairs within a region and how that would give incorrect results.
>
> However, if I was purely interested in just getting all aligned
> read-pairs within the whole genomic space (no restriction to any region
> since this is what I need to normalize by), wouldn't the following code
> work?
>
> library('Rsamtools')
> library('GenomicAlignments')
>
> fl <- system.file("extdata", "ex1.bam", package = "Rsamtools")
>
> flags <- scanBamFlag(isPaired = TRUE,
>                       isProperPair = TRUE,
>                       isUnmappedQuery = FALSE,
>                       hasUnmappedMate = FALSE)
>
> param <- ScanBamParam(flag = flags)
>
> countBam(fl, param = param)[['records']]/2
> [1]1572
>
> galp <- readGAlignmentPairs(fl, param=param)
> length(galp)
> [1]1572
>
>
> On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Correction below.
>
>
>     On 07/03/2014 09:28 AM, Valerie Obenchain wrote:
>
>         Hi,
>
>         On 07/02/2014 12:56 PM, Fong Chun Chan wrote:
>
>             Hi Valerie,
>
>             Thanks for the reply. I'll post some runtime as soon as it
>             finishes
>             running the readGAlignments and readGAlignmentPairs.
>
>             Thanks for the tip on passing range(allExons) into the which
>             param.
>             There is only one issue in that if I wanted to then generate
>             say RPKM
>             values where I need to know the "total number of aligned
>             reads" in a bam
>             file. Because I've read in just a subset of the bam (i.e.
>             the reads that
>             aligned just to exons) I would have technically produced an
>             incorrection
>             calculation of RPKM values (assuming you normalize using the
>             number of
>             aligned reads). On that note, is there a quick and easy way
>             to just
>             count the number of aligned reads or aligned paired reads so
>             I can
>             quickly normalize by this way. This way I can still use the
>             which
>             parameter in readGAlignments and readGAlignmentPairs to
>             speed it up?
>
>
>         For single-end you can use countBam():
>
>           > fl <- system.file("extdata", "ex1.bam", package = "Rsamtools")
>           > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery =
>         FALSE))
>           > countBam(fl, param = param)
>             space start end width    file records nucleotides
>         1    NA    NA  NA    NA ex1.bam    3271      115286
>
>         It's not so clear for paired-end. The mate-pairing algorithm
>         used in the
>         readGAlignment* functions is described on the man page and uses a
>         combination of BAM fields and flags. You can't really get at
>         this simply
>         by creating a param with a flag combination such as
>
>         isPaired = TRUE
>         isProperPair = TRUE
>         isUnmappedQuery = FALSE
>         hasUnmappedMate = FALSE
>
>         For example, if you're interested in this region
>
>         gr <- GRanges("seq1", IRanges(100, 500))
>
>         and set flags,
>
>         flags <- scanBamFlag(isPaired = TRUE,
>                                isProperPair = TRUE,
>                                isUnmappedQuery = FALSE,
>                                hasUnmappedMate = FALSE)
>
>         countBam() reports 335 records. These are individual records so
>         the mate
>         pairs (parts) are each counted separately. Only records that
>         fall within
>         the region defined by 'gr' are counted.
>
>           > param <- ScanBamParam(which = gr, flag = flags)
>           > countBam(fl, param = param)
>             space start end width    file records nucleotides
>         1  seq1   100 500   401 ex1.bam     335       11785
>
>
>         readGAlignmentPairs() reports 223 pairs (446 individual
>         records). All
>         records in 'gr' are mated if possible. In the case where one
>         mate part
>         falls inside 'gr' and one outside, the other mate will be retrieved
>         (i.e., outside the range of 'gr'). This makes the 335 above
>         confusing
>         for counting the number of aligned pairs in the region. The
>         number is a
>         mixture of (1) mate parts both in 'gr' that belong together (2) mate
>         parts with mate outside 'gr' and (3) mates or mate parts that
>         don't meet
>         other mate-pairing criteria.
>
>           > galp <- readGAlignmentPairs(fl, param=param)
>           > length(galp)
>         [1] 223
>
>         Off hand I can't think of a good way to count aligned pairs w/in a
>         specified region.
>
>
>     What I should have said here is I can't think of a good way to
>     accurately count aligned pairs (1 count per pair) in a BAM file w/o
>     using the pairing algo.
>
>     Valerie
>
>
>
>             Also another thing I notice is that when I use the
>             readGAlignmentPairs function, I lose my ability to cancel
>             the command
>             (ctrl-c no longer works).
>
>
>         It's implemented in C/C++. You can try ctrl+\ in the session or kill
>         with the top command from another terminal. Maybe others have
>         suggestions they'll add ...
>
>         Valerie
>
>         Not sure this is a bug, or if it is something
>
>             to do with the internal workings of the function. The loss
>             of ctrl-c
>             also occurs when trying to run as a Rscript. I can seemingly
>             ctrl-c
>             anytime before, but it hits this step it just refuses to cancel.
>
>             Thanks,
>
>
>
>
>
>
>             On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain
>             <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>             <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
>                  Hi,
>
>                  You don't need to call findMateAlignment() when using
>                  readGAlignmentPairs(). In a previous iteration,
>                  readGAlignmentPairs() called findMateAlignment() but
>             this is no
>                  longer the case.
>
>                  readGAlignmentPairs() does have more overhead than
>             readGAlignments()
>                  because of the mate paring. These numbers are for a BAM
>             file with
>                  800484 records (400054 final pairs) on my not-so-speedy
>             laptop.
>
>                  library(microbenchmark)
>                  library(RNAseqData.HNRNPC.bam.____chr14)
>                  fls <- RNAseqData.HNRNPC.bam.chr14_____BAMFILES
>
>                            microbenchmark(____readGAlignments(fls[1]),
>             times=5)
>
>                      Unit: seconds
>                                           expr      min       lq  median
>                    uq
>                        max neval
>                        readGAlignments(fls[1]) 1.753653 1.788292 1.85527
>             2.019071
>                      2.121211     5
>
>
>
>                  ## mate pairing
>
>
>             microbenchmark(____readGAlignmentPairs(fls[1]), times=5)
>
>                      Unit: seconds
>                                               expr      min       lq
>             median
>                      uq      max neval
>                        readGAlignmentPairs(fls[1]) 9.283966 9.300328
>             9.535194
>                      9.843282 11.37441     5
>
>
>                  ## mate pairing
>
>
>             microbenchmark(____readGAlignmentsList(fls[1]), times=5)
>
>                      Unit: seconds
>                                               expr      min       lq
>             median
>                      uq      max neval
>                        readGAlignmentsList(fls[1]) 8.409743 8.457232
>             8.803696
>                      9.692845 9.867628     5
>
>
>
>                  You can set a 'yieldSize' when creating the BamFile.
>             This will chunk
>                  through the file 'yieldSize' records at a time and is
>             useful when
>                  processing a complete file and when there are memory
>             limitations.
>
>                     bf <- BamFile(myfile, yieldSize = 1000000)
>
>                  Herve reported some additional timings for
>             readGAlignmentPairs() on
>                  this post:
>
>             https://stat.ethz.ch/____pipermail/bioconductor/2014-____May/059695.html
>             <https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html>
>
>             <https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html
>             <https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html>>
>
>
>                  Can you provide some numbers for how long your run is
>             taking and the
>                  number of records you're trying to read in?
>
>                  I assume 'allExons' is used for counting against the
>             BAM files in
>                  later code? Maybe this is why your were looking into
>                  summarizeOverlaps(). I should point out that if you're only
>                  interested in the regions overlapping 'allExons' you
>             could call
>                  range(allExons) and use that as the 'which' in the
>             param. This would
>                  focus the area of interest and speed up the reading.
>
>                  Valerie
>
>
>
>
>                  On 06/30/14 12:22, Fong Chun Chan wrote:
>
>                      Hi,
>
>                      I've been using the GenomicFeatures R package to
>             read in some
>                      RNA-seq
>                      paired-end read data. While the readGAlignments()
>             function reads
>                      in the bam
>                      file within a minute, I've noticed that the
>                      readGAlignmentPairs() function
>                      is extremely slow.
>
>                      This is even after restricting the space to just a
>             single
>                      chromosome along
>                      with setting the flags isDuplicate = FALSE and
>             isNotPrimaryRead
>                      = FALSE
>                      (the latter being suggested in the reference):
>
>                      "An easy way to avoid this degradation is to load
>             only primary
>                      alignments
>                      by setting the isNotPrimaryRead ï¬‚ag to FALSE in
>             ScanBamParam()"
>
>                      Based on my understanding, it seems that the ï¬
>                      ndMateAlignment() function
>
>                      has to be run first. Is there any other way to
>             speed it up?
>                      Perhaps I am
>                      doing something wrong here?
>
>                      Code and sessionInfo() below.
>
>                      Thanks!
>
>                      ----
>                      library("GenomicFeatures")
>                      library('GenomicAlignments')
>                      library('Rsamtools')
>                      si <- seqinfo(BamFile(arguments$____bamFile))
>                      gr <- GRanges(arguments$chr, IRanges(100,
>                      seqlengths(si)[[arguments$chr]____]-100))
>                      allExons <- exons(txdb, columns = c('gene_id',
>             'exon_id',
>                      'exon_name'))
>                      scf <- scanBamFlag( isDuplicate = FALSE,
>             isNotPrimaryRead =
>                      FALSE ) #
>                      remove duplicate reads, remove non-primary reads (i.e.
>                      multi-aligned reads)
>                      reads <- readGAlignmentPairs( arguments$bamFile,
>             param =
>                      ScanBamParam(
>                      which = gr, flag = scf ) ); # grab reads in
>             specific region
>
>                          sessionInfo()
>
>                      R version 3.1.0 (2014-04-10)
>                      Platform: x86_64-unknown-linux-gnu (64-bit)
>
>                      locale:
>                         [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>                      LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>                      LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>                        LC_PAPER=en_US.UTF-8
>                              LC_NAME=C                  LC_ADDRESS=C
>                      LC_TELEPHONE=C
>                                    LC_MEASUREMENT=en_US.UTF-8
>             LC_IDENTIFICATION=C
>
>                      attached base packages:
>                      [1] parallel  stats     graphics  grDevices utils
>                datasets
>                        methods
>                      base
>
>                      other attached packages:
>                         [1] argparse_1.0.1          proto_0.3-10
>                         GenomicAlignments_1.0.1 BSgenome_1.32.0
>             Rsamtools_1.16.1
>                         Biostrings_2.32.0       XVector_0.4.0
>                      GenomicFeatures_1.16.2
>                         AnnotationDbi_1.26.0    Biobase_2.24.0
>                        GenomicRanges_1.16.3
>                         GenomeInfoDb_1.0.2      IRanges_1.22.9
>                        BiocGenerics_0.10.0
>                      vimcom.plus_0.9-93
>                      [16] setwidth_1.0-3          colorout_1.0-2
>
>                      loaded via a namespace (and not attached):
>                         [1] BatchJobs_1.2      BBmisc_1.7
>             BiocParallel_0.6.1
>                      biomaRt_2.20.0     bitops_1.0-6       brew_1.0-6
>                      checkmate_1.0
>                         codetools_0.2-8    DBI_0.2-7          digest_0.6.4
>             fail_1.2
>                          findpython_1.0.1   foreach_1.4.2      getopt_1.20.0
>                        iterators_1.0.7
>                           plyr_1.8.1         Rcpp_0.11.2
>               RCurl_1.95-4.1
>                      [19] rjson_0.2.14       RSQLite_0.11.4
>             rtracklayer_1.24.2
>                      sendmailR_1.1-2    stats4_3.1.0       stringr_0.6.2
>             tools_3.1.0
>                         XML_3.98-1.1       zlibbioc_1.10.0
>
>
>
>                               [[alternative HTML version deleted]]
>
>
>
>                      ___________________________________________________
>                      Bioconductor mailing list
>             Bioconductor at r-project.org
>             <mailto:Bioconductor at r-project.org>
>             <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>
>             https://stat.ethz.ch/mailman/____listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                      Search the archives:
>
>             http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>                  --
>                  Valerie Obenchain
>                  Program in Computational Biology
>                  Fred Hutchinson Cancer Research Center
>                  1100 Fairview Ave. N, Seattle, WA 98109
>
>                  Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>             <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>                  Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>             <tel:%28206%29%20667-3158>
>
>
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>