[BioC] Thread safety of BSgenome getSeq()

Johnston, Jeffrey jjj at stowers.org
Tue Jul 1 19:40:43 CEST 2014


Thanks for looking into this Hervé and Martin, and for the quick fix. It is working as expected now. I did have to install the updated packages from source as the binaries do not seem to be updating for Mac or Linux. 

Thanks,
Jeff

On Jun 18, 2014, at 1:14 PM, Hervé Pagès <hpages at fhcrc.org> wrote:

> Hi Jeff,
> 
> Sorry for the delay in answering.
> 
> Thanks for reporting this. Martin Morgan was able to reproduce and
> found out that the problem is originating in the razf C code from
> Samtools (which is included in the Rsamtools package). The razf code
> provides the low level functionality for writing/reading RAZipp'ed
> compressed FASTA files, which is the format used in BioC 2.14 for
> storing the genomic sequences contained in the BSgenome data packages.
> 
> Other issues have been reported with these RAZip based BSgenome
> packages, and, more importantly, it seems that the original Samtools
> developers have moved to other projects and are not going to maintain
> the razf code anymore.
> 
> So this new issue you found convinced me that we should stop using
> the RAZip format in the BSgenome data packages. I'm currently in the
> process of rolling them back to the old on-disk format (i.e. 1
> serialized DNAString object per chromosome). Expect them to show up in
> the BioC package repos in the next couple of days. They'll be at
> version 1.3.1000 (source packages will show up today, binary packages
> will follow).
> 
> Also note that in devel (BioC 3.0), the BSgenome data packages use the
> 2bit format (from UCSC), which is superior in any aspect to the
> previous on-disk formats (i.e. in terms of speed, memory usage, and
> size on disk). Unfortunately I cannot use this format for the BSgenome
> data packages in release because that relies on functionalities in
> the BSgenome *software* package that are only available in BioC devel.
> 
> Let me know if you have any questions or concerns about this.
> 
> Cheers,
> H.
> 
> On 06/06/2014 01:15 PM, Johnston, Jeffrey wrote:
>> Hi,
>> 
>> I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU:
>> 
>> #------------------
>> library(GenomicRanges)
>> library(BSgenome.Dmelanogaster.UCSC.dm3)
>> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+")
>> gr.list <- lapply(1:6, function(i) gr )
>> 
>> seqs.list <- mclapply(gr.list, function(gr) {
>>   message("getSeq() started")
>>   s <- getSeq(Dmelanogaster, gr)  # does not reliably return if mc.cores > 1
>>   message("getSeq() returned")
>>   s
>> }, mc.cores=2)
>> #------------------
>> 
>> If I instead load the BSgenome package inside the parallelized function everything is fine:
>> 
>> #------------------
>> library(GenomicRanges)
>> library(BSgenome.Dmelanogaster.UCSC.dm3)
>> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+")
>> detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE)
>> gr.list <- lapply(1:6, function(i) gr )
>> 
>> seqs.list <- mclapply(gr.list, function(gr) {
>>   library(BSgenome.Dmelanogaster.UCSC.dm3)
>>   message("getSeq() started")
>>   s <- getSeq(Dmelanogaster, gr)  # always works
>>   message("getSeq() returned")
>>   s
>> }, mc.cores=2)
>> #------------------
>> 
>> I can reproduce this issue on both Mac and Linux (both 64-bit).
>> 
>> Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()?
>> 
>> Thanks,
>> Jeff Johnston
>> Zeitlinger Lab
>> Stowers Institute for Medical Research
>> 
>> #------------------
>>> sessionInfo()
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> 
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0                        Biostrings_2.32.0                      XVector_0.4.0
>> [5] GenomicRanges_1.16.3                   GenomeInfoDb_1.0.2                     IRanges_1.22.8                         BiocGenerics_0.10.0
>> [9] setwidth_1.0-3
>> 
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-6     Rsamtools_1.16.0 stats4_3.1.0     zlibbioc_1.10.0
>> #------------------
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list