[BioC] GenomicAlignments - Speeding up readGAlignmentPairs()

Valerie Obenchain vobencha at fhcrc.org
Tue Jul 1 19:06:18 CEST 2014


Hi,

You don't need to call findMateAlignment() when using 
readGAlignmentPairs(). In a previous iteration, readGAlignmentPairs() 
called findMateAlignment() but this is no longer the case.

readGAlignmentPairs() does have more overhead than readGAlignments() 
because of the mate paring. These numbers are for a BAM file with 800484 
records (400054 final pairs) on my not-so-speedy laptop.

library(microbenchmark)
library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES

>>  microbenchmark(readGAlignments(fls[1]), times=5)
> Unit: seconds
>                     expr      min       lq  median       uq      max neval
>  readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071 2.121211     5
>>

## mate pairing
>>  microbenchmark(readGAlignmentPairs(fls[1]), times=5)
> Unit: seconds
>                         expr      min       lq   median       uq      max neval
>  readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194 9.843282 11.37441     5

## mate pairing
>>  microbenchmark(readGAlignmentsList(fls[1]), times=5)
> Unit: seconds
>                         expr      min       lq   median       uq      max neval
>  readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696 9.692845 9.867628     5


You can set a 'yieldSize' when creating the BamFile. This will chunk 
through the file 'yieldSize' records at a time and is useful when 
processing a complete file and when there are memory limitations.

   bf <- BamFile(myfile, yieldSize = 1000000)

Herve reported some additional timings for readGAlignmentPairs() on this 
post:

   https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html


Can you provide some numbers for how long your run is taking and the 
number of records you're trying to read in?

I assume 'allExons' is used for counting against the BAM files in later 
code? Maybe this is why your were looking into summarizeOverlaps(). I 
should point out that if you're only interested in the regions 
overlapping 'allExons' you could call range(allExons) and use that as 
the 'which' in the param. This would focus the area of interest and 
speed up the reading.

Valerie



On 06/30/14 12:22, Fong Chun Chan wrote:
> Hi,
>
> I've been using the GenomicFeatures R package to read in some RNA-seq
> paired-end read data. While the readGAlignments() function reads in the bam
> file within a minute, I've noticed that the readGAlignmentPairs() function
> is extremely slow.
>
> This is even after restricting the space to just a single chromosome along
> with setting the flags isDuplicate = FALSE and isNotPrimaryRead = FALSE
> (the latter being suggested in the reference):
>
> "An easy way to avoid this degradation is to load only primary alignments
> by setting the isNotPrimaryRead flag to FALSE in ScanBamParam()"
>
> Based on my understanding, it seems that the findMateAlignment() function
> has to be run first. Is there any other way to speed it up? Perhaps I am
> doing something wrong here?
>
> Code and sessionInfo() below.
>
> Thanks!
>
> ----
> library("GenomicFeatures")
> library('GenomicAlignments')
> library('Rsamtools')
> si <- seqinfo(BamFile(arguments$bamFile))
> gr <- GRanges(arguments$chr, IRanges(100,
> seqlengths(si)[[arguments$chr]]-100))
> allExons <- exons(txdb, columns = c('gene_id', 'exon_id', 'exon_name'))
> scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead = FALSE ) #
> remove duplicate reads, remove non-primary reads (i.e. multi-aligned reads)
> reads <- readGAlignmentPairs( arguments$bamFile, param = ScanBamParam(
> which = gr, flag = scf ) ); # grab reads in specific region
>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8
>        LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
>              LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
>   [1] argparse_1.0.1          proto_0.3-10
>   GenomicAlignments_1.0.1 BSgenome_1.32.0         Rsamtools_1.16.1
>   Biostrings_2.32.0       XVector_0.4.0           GenomicFeatures_1.16.2
>   AnnotationDbi_1.26.0    Biobase_2.24.0          GenomicRanges_1.16.3
>   GenomeInfoDb_1.0.2      IRanges_1.22.9          BiocGenerics_0.10.0
> vimcom.plus_0.9-93
> [16] setwidth_1.0-3          colorout_1.0-2
>
> loaded via a namespace (and not attached):
>   [1] BatchJobs_1.2      BBmisc_1.7         BiocParallel_0.6.1
> biomaRt_2.20.0     bitops_1.0-6       brew_1.0-6         checkmate_1.0
>   codetools_0.2-8    DBI_0.2-7          digest_0.6.4       fail_1.2
>    findpython_1.0.1   foreach_1.4.2      getopt_1.20.0      iterators_1.0.7
>     plyr_1.8.1         Rcpp_0.11.2        RCurl_1.95-4.1
> [19] rjson_0.2.14       RSQLite_0.11.4     rtracklayer_1.24.2
> sendmailR_1.1-2    stats4_3.1.0       stringr_0.6.2      tools_3.1.0
>   XML_3.98-1.1       zlibbioc_1.10.0
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158



More information about the Bioconductor mailing list