[BioC] [devteam-bioc] readGAlignmentPairs perfromace issue

Hervé Pagès hpages at fhcrc.org
Tue May 20 21:27:52 CEST 2014


Hi Phil,

I don't have access to your BAM file but here are the timings I get for
readGAlignmentPairs(). (My file contains 100,000,000 pairs but I use
'which' to load only pairs located on chr1-4 so the result contains only
16,938,029 pairs):

   - with BioC 2.13:
        user  system elapsed
     439.784  30.218 470.136

   - with BioC 2.14:
        user  system elapsed
     319.212  11.492 331.201

So the new code is about 40% faster for me (it also uses about 20% less
memory).

The timings you report below with BioC 2.14 for loading 108,592,829
pairs look reasonable to me. What is really surprising is the timing
you get with BioC 2.13: only 208s to load 108,592,829 pairs! This is
15x faster than with BioC 2.14! Do you confirm this? If so, would you
mind making the file accessible to us so we can have a look at it?

Thanks,
H.


On 05/20/2014 06:31 AM, Maintainer wrote:
> Hi Valerie,
>
> Thank you for getting back to me. Here are the times for
> readGAlignmentPairs, readGAlignmentsList, and scanBam using the code you
> sent.
>
> $readGAlignmentsList
>      user   system  elapsed
> 2529.510   57.487 2589.144
>
> $scanBam
>      user   system  elapsed
> 2465.353   49.404 2516.275
>
> $readGAlignmentPairs
>      user   system  elapsed
> 2560.754   56.612 2619.769
>
> Best wishes
> Phil
>
> On Fri, 2014-05-16 at 12:55 -0700, Valerie Obenchain wrote:
>> Hi Phil,
>>
>> We have several functions that call the same C code in the background.
>> To help isolate the problem can you please run your code with scanBam()
>> and readGAlignmentsList()?
>>
>> bf <- BamFile(fl, asMates=TRUE)
>> readGAlignmentsList(bf, param=param0)
>> scanBam(bf, param=param0)
>>
>> readGAlignmentsList() and readGAlignementPairs() should be very close in
>> time. scanBam() will be faster but not by a huge amount.
>>
>> Thanks.
>> Valerie
>>
>>
>> On 05/13/2014 07:23 AM, Maintainer wrote:
>>> Hi Guys,
>>>
>>> I'm experiencing some performance issues with readGAlignmentPairs from the latest version of Bioconductor (GenomicAlignments_1.0.1, BioC 2.14, R 3.1.0)
>>>
>>> Reading RNASeq paired reads aligned to chr19 (mm9) from a BAM file containing 108,592,829 paired reads takes 3118s. The same code run in R-3.0.2, BioC 2.13, Rsamtools_1.14.3 takes 208s. The results are identical across the two versions.
>>>
>>> Here's the code:
>>>
>>> library(GenomicAlignments)
>>> library(Rsamtools)
>>>
>>> param0 <- ScanBamParam(which=GRanges(seqnames="chr19",
>>> ranges=IRanges(start=1, end=chr19Length))
>>> rd <- readGAlignmentPairs(bamFile, param=param0)
>>>
>>> Any ideas as to why this might be?
>>>
>>> Thanks in advance
>>>
>>> Phil East
>>>
>>>
>>>
>>>    -- output of sessionInfo():
>>>
>>> R version 3.1.0 (2014-04-10)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>    [1] LC_CTYPE=en_GB       LC_NUMERIC=C         LC_TIME=en_GB
>>>    [4] LC_COLLATE=en_GB     LC_MONETARY=en_GB    LC_MESSAGES=en_GB
>>>    [7] LC_PAPER=en_GB       LC_NAME=C            LC_ADDRESS=C
>>> [10] LC_TELEPHONE=C       LC_MEASUREMENT=en_GB LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] grDevices datasets  parallel  stats     graphics  utils     methods
>>> [8] base
>>>
>>> other attached packages:
>>>    [1] GenomicAlignments_1.0.1 BSgenome_1.32.0         Rsamtools_1.16.0
>>>    [4] Biostrings_2.32.0       XVector_0.4.0           GenomicRanges_1.16.3
>>>    [7] GenomeInfoDb_1.0.2      IRanges_1.22.6          Biobase_2.24.0
>>> [10] BiocGenerics_0.10.0
>>>
>>> loaded via a namespace (and not attached):
>>>    [1] BatchJobs_1.2      BBmisc_1.6         BiocParallel_0.6.0 bitops_1.0-6
>>>    [5] brew_1.0-6         codetools_0.2-8    DBI_0.2-7          digest_0.6.4
>>>    [9] fail_1.2           foreach_1.4.2      iterators_1.0.7    plyr_1.8.1
>>> [13] Rcpp_0.11.1        RSQLite_0.11.4     sendmailR_1.1-2    stats4_3.1.0
>>> [17] stringr_0.6.2      tools_3.1.0        zlibbioc_1.10.0
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> ________________________________________________________________________
>>> devteam-bioc mailing list
>>> To unsubscribe from this mailing list send a blank email to
>>> devteam-bioc-leave at lists.fhcrc.org
>>> You can also unsubscribe or change your personal options at
>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>
>>
>>
>
>
>
> NOTICE AND DISCLAIMER
> This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose.
>
> We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you.
> Cancer Research UK
> Registered charity in England and Wales (1089464), Scotland (SC041666) and the Isle of Man (1103)
> A company limited by guarantee.  Registered company in England and Wales (4325234) and the Isle of Man (5713F).
> Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list