[BioC] GenomicAlignments - Speeding up readGAlignmentPairs()

Thu Jul 3 18:40:24 CEST 2014

Correction below.

On 07/03/2014 09:28 AM, Valerie Obenchain wrote:
> Hi,
>
> On 07/02/2014 12:56 PM, Fong Chun Chan wrote:
>> Hi Valerie,
>>
>> Thanks for the reply. I'll post some runtime as soon as it finishes
>> running the readGAlignments and readGAlignmentPairs.
>>
>> Thanks for the tip on passing range(allExons) into the which param.
>> There is only one issue in that if I wanted to then generate say RPKM
>> values where I need to know the "total number of aligned reads" in a bam
>> file. Because I've read in just a subset of the bam (i.e. the reads that
>> aligned just to exons) I would have technically produced an incorrection
>> calculation of RPKM values (assuming you normalize using the number of
>> aligned reads). On that note, is there a quick and easy way to just
>> count the number of aligned reads or aligned paired reads so I can
>> quickly normalize by this way. This way I can still use the which
>> parameter in readGAlignments and readGAlignmentPairs to speed it up?
>
> For single-end you can use countBam():
>
>  > fl <- system.file("extdata", "ex1.bam", package = "Rsamtools")
>  > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = FALSE))
>  > countBam(fl, param = param)
>    space start end width    file records nucleotides
> 1    NA    NA  NA    NA ex1.bam    3271      115286
>
> It's not so clear for paired-end. The mate-pairing algorithm used in the
> readGAlignment* functions is described on the man page and uses a
> combination of BAM fields and flags. You can't really get at this simply
> by creating a param with a flag combination such as
>
> isPaired = TRUE
> isProperPair = TRUE
> isUnmappedQuery = FALSE
> hasUnmappedMate = FALSE
>
> For example, if you're interested in this region
>
> gr <- GRanges("seq1", IRanges(100, 500))
>
> and set flags,
>
> flags <- scanBamFlag(isPaired = TRUE,
>                       isProperPair = TRUE,
>                       isUnmappedQuery = FALSE,
>                       hasUnmappedMate = FALSE)
>
> countBam() reports 335 records. These are individual records so the mate
> pairs (parts) are each counted separately. Only records that fall within
> the region defined by 'gr' are counted.
>
>  > param <- ScanBamParam(which = gr, flag = flags)
>  > countBam(fl, param = param)
>    space start end width    file records nucleotides
> 1  seq1   100 500   401 ex1.bam     335       11785
>
>
> readGAlignmentPairs() reports 223 pairs (446 individual records). All
> records in 'gr' are mated if possible. In the case where one mate part
> falls inside 'gr' and one outside, the other mate will be retrieved
> (i.e., outside the range of 'gr'). This makes the 335 above confusing
> for counting the number of aligned pairs in the region. The number is a
> mixture of (1) mate parts both in 'gr' that belong together (2) mate
> parts with mate outside 'gr' and (3) mates or mate parts that don't meet
> other mate-pairing criteria.
>
>  > galp <- readGAlignmentPairs(fl, param=param)
>  > length(galp)
> [1] 223
>
> Off hand I can't think of a good way to count aligned pairs w/in a
> specified region.

What I should have said here is I can't think of a good way to 
accurately count aligned pairs (1 count per pair) in a BAM file w/o 
using the pairing algo.

Valerie

>
>>
>> Also another thing I notice is that when I use the
>> readGAlignmentPairs function, I lose my ability to cancel the command
>> (ctrl-c no longer works).
>
> It's implemented in C/C++. You can try ctrl+\ in the session or kill
> with the top command from another terminal. Maybe others have
> suggestions they'll add ...
>
> Valerie
>
> Not sure this is a bug, or if it is something
>> to do with the internal workings of the function. The loss of ctrl-c
>> also occurs when trying to run as a Rscript. I can seemingly ctrl-c
>> anytime before, but it hits this step it just refuses to cancel.
>>
>> Thanks,
>>
>>
>>
>>
>>
>>
>> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> wrote:
>>
>>     Hi,
>>
>>     You don't need to call findMateAlignment() when using
>>     readGAlignmentPairs(). In a previous iteration,
>>     readGAlignmentPairs() called findMateAlignment() but this is no
>>     longer the case.
>>
>>     readGAlignmentPairs() does have more overhead than readGAlignments()
>>     because of the mate paring. These numbers are for a BAM file with
>>     800484 records (400054 final pairs) on my not-so-speedy laptop.
>>
>>     library(microbenchmark)
>>     library(RNAseqData.HNRNPC.bam.__chr14)
>>     fls <- RNAseqData.HNRNPC.bam.chr14___BAMFILES
>>
>>               microbenchmark(__readGAlignments(fls[1]), times=5)
>>
>>         Unit: seconds
>>                              expr      min       lq  median       uq
>>           max neval
>>           readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071
>>         2.121211     5
>>
>>
>>
>>     ## mate pairing
>>
>>               microbenchmark(__readGAlignmentPairs(fls[1]), times=5)
>>
>>         Unit: seconds
>>                                  expr      min       lq   median
>>         uq      max neval
>>           readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194
>>         9.843282 11.37441     5
>>
>>
>>     ## mate pairing
>>
>>               microbenchmark(__readGAlignmentsList(fls[1]), times=5)
>>
>>         Unit: seconds
>>                                  expr      min       lq   median
>>         uq      max neval
>>           readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696
>>         9.692845 9.867628     5
>>
>>
>>
>>     You can set a 'yieldSize' when creating the BamFile. This will chunk
>>     through the file 'yieldSize' records at a time and is useful when
>>     processing a complete file and when there are memory limitations.
>>
>>        bf <- BamFile(myfile, yieldSize = 1000000)
>>
>>     Herve reported some additional timings for readGAlignmentPairs() on
>>     this post:
>>
>>     https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html
>>     <https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html>
>>
>>
>>     Can you provide some numbers for how long your run is taking and the
>>     number of records you're trying to read in?
>>
>>     I assume 'allExons' is used for counting against the BAM files in
>>     later code? Maybe this is why your were looking into
>>     summarizeOverlaps(). I should point out that if you're only
>>     interested in the regions overlapping 'allExons' you could call
>>     range(allExons) and use that as the 'which' in the param. This would
>>     focus the area of interest and speed up the reading.
>>
>>     Valerie
>>
>>
>>
>>
>>     On 06/30/14 12:22, Fong Chun Chan wrote:
>>
>>         Hi,
>>
>>         I've been using the GenomicFeatures R package to read in some
>>         RNA-seq
>>         paired-end read data. While the readGAlignments() function reads
>>         in the bam
>>         file within a minute, I've noticed that the
>>         readGAlignmentPairs() function
>>         is extremely slow.
>>
>>         This is even after restricting the space to just a single
>>         chromosome along
>>         with setting the flags isDuplicate = FALSE and isNotPrimaryRead
>>         = FALSE
>>         (the latter being suggested in the reference):
>>
>>         "An easy way to avoid this degradation is to load only primary
>>         alignments
>>         by setting the isNotPrimaryRead ï¬‚ag to FALSE in ScanBamParam()"
>>
>>         Based on my understanding, it seems that the ï¬
>>         ndMateAlignment() function
>>
>>         has to be run first. Is there any other way to speed it up?
>>         Perhaps I am
>>         doing something wrong here?
>>
>>         Code and sessionInfo() below.
>>
>>         Thanks!
>>
>>         ----
>>         library("GenomicFeatures")
>>         library('GenomicAlignments')
>>         library('Rsamtools')
>>         si <- seqinfo(BamFile(arguments$__bamFile))
>>         gr <- GRanges(arguments$chr, IRanges(100,
>>         seqlengths(si)[[arguments$chr]__]-100))
>>         allExons <- exons(txdb, columns = c('gene_id', 'exon_id',
>>         'exon_name'))
>>         scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead =
>>         FALSE ) #
>>         remove duplicate reads, remove non-primary reads (i.e.
>>         multi-aligned reads)
>>         reads <- readGAlignmentPairs( arguments$bamFile, param =
>>         ScanBamParam(
>>         which = gr, flag = scf ) ); # grab reads in specific region
>>
>>             sessionInfo()
>>
>>         R version 3.1.0 (2014-04-10)
>>         Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>         locale:
>>            [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>         LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>         LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>           LC_PAPER=en_US.UTF-8
>>                 LC_NAME=C                  LC_ADDRESS=C
>>         LC_TELEPHONE=C
>>                       LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>         attached base packages:
>>         [1] parallel  stats     graphics  grDevices utils     datasets
>>           methods
>>         base
>>
>>         other attached packages:
>>            [1] argparse_1.0.1          proto_0.3-10
>>            GenomicAlignments_1.0.1 BSgenome_1.32.0
>> Rsamtools_1.16.1
>>            Biostrings_2.32.0       XVector_0.4.0
>>         GenomicFeatures_1.16.2
>>            AnnotationDbi_1.26.0    Biobase_2.24.0
>>           GenomicRanges_1.16.3
>>            GenomeInfoDb_1.0.2      IRanges_1.22.9
>>           BiocGenerics_0.10.0
>>         vimcom.plus_0.9-93
>>         [16] setwidth_1.0-3          colorout_1.0-2
>>
>>         loaded via a namespace (and not attached):
>>            [1] BatchJobs_1.2      BBmisc_1.7         BiocParallel_0.6.1
>>         biomaRt_2.20.0     bitops_1.0-6       brew_1.0-6
>>         checkmate_1.0
>>            codetools_0.2-8    DBI_0.2-7          digest_0.6.4
>> fail_1.2
>>             findpython_1.0.1   foreach_1.4.2      getopt_1.20.0
>>           iterators_1.0.7
>>              plyr_1.8.1         Rcpp_0.11.2        RCurl_1.95-4.1
>>         [19] rjson_0.2.14       RSQLite_0.11.4     rtracklayer_1.24.2
>>         sendmailR_1.1-2    stats4_3.1.0       stringr_0.6.2
>> tools_3.1.0
>>            XML_3.98-1.1       zlibbioc_1.10.0
>>
>>
>>
>>                  [[alternative HTML version deleted]]
>>
>>
>>
>>         _________________________________________________
>>         Bioconductor mailing list
>>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>         Search the archives:
>>
>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
>>     --
>>     Valerie Obenchain
>>     Program in Computational Biology
>>     Fred Hutchinson Cancer Research Center
>>     1100 Fairview Ave. N, Seattle, WA 98109
>>
>>     Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>     Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor