[BioC] GenomicAlignments - Speeding up readGAlignmentPairs()

Thu Jul 3 18:28:10 CEST 2014

Hi,

On 07/02/2014 12:56 PM, Fong Chun Chan wrote:
> Hi Valerie,
>
> Thanks for the reply. I'll post some runtime as soon as it finishes
> running the readGAlignments and readGAlignmentPairs.
>
> Thanks for the tip on passing range(allExons) into the which param.
> There is only one issue in that if I wanted to then generate say RPKM
> values where I need to know the "total number of aligned reads" in a bam
> file. Because I've read in just a subset of the bam (i.e. the reads that
> aligned just to exons) I would have technically produced an incorrection
> calculation of RPKM values (assuming you normalize using the number of
> aligned reads). On that note, is there a quick and easy way to just
> count the number of aligned reads or aligned paired reads so I can
> quickly normalize by this way. This way I can still use the which
> parameter in readGAlignments and readGAlignmentPairs to speed it up?

For single-end you can use countBam():

 > fl <- system.file("extdata", "ex1.bam", package = "Rsamtools")
 > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = FALSE))
 > countBam(fl, param = param)
   space start end width    file records nucleotides
1    NA    NA  NA    NA ex1.bam    3271      115286

It's not so clear for paired-end. The mate-pairing algorithm used in the 
readGAlignment* functions is described on the man page and uses a 
combination of BAM fields and flags. You can't really get at this simply 
by creating a param with a flag combination such as

isPaired = TRUE
isProperPair = TRUE
isUnmappedQuery = FALSE
hasUnmappedMate = FALSE

For example, if you're interested in this region

gr <- GRanges("seq1", IRanges(100, 500))

and set flags,

flags <- scanBamFlag(isPaired = TRUE,
                      isProperPair = TRUE,
                      isUnmappedQuery = FALSE,
                      hasUnmappedMate = FALSE)

countBam() reports 335 records. These are individual records so the mate 
pairs (parts) are each counted separately. Only records that fall within 
the region defined by 'gr' are counted.

 > param <- ScanBamParam(which = gr, flag = flags)
 > countBam(fl, param = param)
   space start end width    file records nucleotides
1  seq1   100 500   401 ex1.bam     335       11785

readGAlignmentPairs() reports 223 pairs (446 individual records). All 
records in 'gr' are mated if possible. In the case where one mate part 
falls inside 'gr' and one outside, the other mate will be retrieved 
(i.e., outside the range of 'gr'). This makes the 335 above confusing 
for counting the number of aligned pairs in the region. The number is a 
mixture of (1) mate parts both in 'gr' that belong together (2) mate 
parts with mate outside 'gr' and (3) mates or mate parts that don't meet 
other mate-pairing criteria.

 > galp <- readGAlignmentPairs(fl, param=param)
 > length(galp)
[1] 223

Off hand I can't think of a good way to count aligned pairs w/in a 
specified region.

>
> Also another thing I notice is that when I use the
> readGAlignmentPairs function, I lose my ability to cancel the command
> (ctrl-c no longer works).

It's implemented in C/C++. You can try ctrl+\ in the session or kill 
with the top command from another terminal. Maybe others have 
suggestions they'll add ...

Valerie

Not sure this is a bug, or if it is something
> to do with the internal workings of the function. The loss of ctrl-c
> also occurs when trying to run as a Rscript. I can seemingly ctrl-c
> anytime before, but it hits this step it just refuses to cancel.
>
> Thanks,
>
>
>
>
>
>
> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Hi,
>
>     You don't need to call findMateAlignment() when using
>     readGAlignmentPairs(). In a previous iteration,
>     readGAlignmentPairs() called findMateAlignment() but this is no
>     longer the case.
>
>     readGAlignmentPairs() does have more overhead than readGAlignments()
>     because of the mate paring. These numbers are for a BAM file with
>     800484 records (400054 final pairs) on my not-so-speedy laptop.
>
>     library(microbenchmark)
>     library(RNAseqData.HNRNPC.bam.__chr14)
>     fls <- RNAseqData.HNRNPC.bam.chr14___BAMFILES
>
>               microbenchmark(__readGAlignments(fls[1]), times=5)
>
>         Unit: seconds
>                              expr      min       lq  median       uq
>           max neval
>           readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071
>         2.121211     5
>
>
>
>     ## mate pairing
>
>               microbenchmark(__readGAlignmentPairs(fls[1]), times=5)
>
>         Unit: seconds
>                                  expr      min       lq   median
>         uq      max neval
>           readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194
>         9.843282 11.37441     5
>
>
>     ## mate pairing
>
>               microbenchmark(__readGAlignmentsList(fls[1]), times=5)
>
>         Unit: seconds
>                                  expr      min       lq   median
>         uq      max neval
>           readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696
>         9.692845 9.867628     5
>
>
>
>     You can set a 'yieldSize' when creating the BamFile. This will chunk
>     through the file 'yieldSize' records at a time and is useful when
>     processing a complete file and when there are memory limitations.
>
>        bf <- BamFile(myfile, yieldSize = 1000000)
>
>     Herve reported some additional timings for readGAlignmentPairs() on
>     this post:
>
>     https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html
>     <https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html>
>
>
>     Can you provide some numbers for how long your run is taking and the
>     number of records you're trying to read in?
>
>     I assume 'allExons' is used for counting against the BAM files in
>     later code? Maybe this is why your were looking into
>     summarizeOverlaps(). I should point out that if you're only
>     interested in the regions overlapping 'allExons' you could call
>     range(allExons) and use that as the 'which' in the param. This would
>     focus the area of interest and speed up the reading.
>
>     Valerie
>
>
>
>
>     On 06/30/14 12:22, Fong Chun Chan wrote:
>
>         Hi,
>
>         I've been using the GenomicFeatures R package to read in some
>         RNA-seq
>         paired-end read data. While the readGAlignments() function reads
>         in the bam
>         file within a minute, I've noticed that the
>         readGAlignmentPairs() function
>         is extremely slow.
>
>         This is even after restricting the space to just a single
>         chromosome along
>         with setting the flags isDuplicate = FALSE and isNotPrimaryRead
>         = FALSE
>         (the latter being suggested in the reference):
>
>         "An easy way to avoid this degradation is to load only primary
>         alignments
>         by setting the isNotPrimaryRead ï¬‚ag to FALSE in ScanBamParam()"
>
>         Based on my understanding, it seems that the ï¬
>         ndMateAlignment() function
>
>         has to be run first. Is there any other way to speed it up?
>         Perhaps I am
>         doing something wrong here?
>
>         Code and sessionInfo() below.
>
>         Thanks!
>
>         ----
>         library("GenomicFeatures")
>         library('GenomicAlignments')
>         library('Rsamtools')
>         si <- seqinfo(BamFile(arguments$__bamFile))
>         gr <- GRanges(arguments$chr, IRanges(100,
>         seqlengths(si)[[arguments$chr]__]-100))
>         allExons <- exons(txdb, columns = c('gene_id', 'exon_id',
>         'exon_name'))
>         scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead =
>         FALSE ) #
>         remove duplicate reads, remove non-primary reads (i.e.
>         multi-aligned reads)
>         reads <- readGAlignmentPairs( arguments$bamFile, param =
>         ScanBamParam(
>         which = gr, flag = scf ) ); # grab reads in specific region
>
>             sessionInfo()
>
>         R version 3.1.0 (2014-04-10)
>         Platform: x86_64-unknown-linux-gnu (64-bit)
>
>         locale:
>            [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>         LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>         LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>           LC_PAPER=en_US.UTF-8
>                 LC_NAME=C                  LC_ADDRESS=C
>         LC_TELEPHONE=C
>                       LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>         attached base packages:
>         [1] parallel  stats     graphics  grDevices utils     datasets
>           methods
>         base
>
>         other attached packages:
>            [1] argparse_1.0.1          proto_0.3-10
>            GenomicAlignments_1.0.1 BSgenome_1.32.0         Rsamtools_1.16.1
>            Biostrings_2.32.0       XVector_0.4.0
>         GenomicFeatures_1.16.2
>            AnnotationDbi_1.26.0    Biobase_2.24.0
>           GenomicRanges_1.16.3
>            GenomeInfoDb_1.0.2      IRanges_1.22.9
>           BiocGenerics_0.10.0
>         vimcom.plus_0.9-93
>         [16] setwidth_1.0-3          colorout_1.0-2
>
>         loaded via a namespace (and not attached):
>            [1] BatchJobs_1.2      BBmisc_1.7         BiocParallel_0.6.1
>         biomaRt_2.20.0     bitops_1.0-6       brew_1.0-6
>         checkmate_1.0
>            codetools_0.2-8    DBI_0.2-7          digest_0.6.4       fail_1.2
>             findpython_1.0.1   foreach_1.4.2      getopt_1.20.0
>           iterators_1.0.7
>              plyr_1.8.1         Rcpp_0.11.2        RCurl_1.95-4.1
>         [19] rjson_0.2.14       RSQLite_0.11.4     rtracklayer_1.24.2
>         sendmailR_1.1-2    stats4_3.1.0       stringr_0.6.2      tools_3.1.0
>            XML_3.98-1.1       zlibbioc_1.10.0
>
>
>
>                  [[alternative HTML version deleted]]
>
>
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     --
>     Valerie Obenchain
>     Program in Computational Biology
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, Seattle, WA 98109
>
>     Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>     Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>
>