[BioC] Thread safety of BSgenome getSeq()

Johnston, Jeffrey jjj at stowers.org
Fri Jun 6 22:15:29 CEST 2014


Hi,

I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU:

#------------------
library(GenomicRanges)
library(BSgenome.Dmelanogaster.UCSC.dm3)
gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+")
gr.list <- lapply(1:6, function(i) gr )

seqs.list <- mclapply(gr.list, function(gr) {
  message("getSeq() started")
  s <- getSeq(Dmelanogaster, gr)  # does not reliably return if mc.cores > 1
  message("getSeq() returned")
  s
}, mc.cores=2)
#------------------

If I instead load the BSgenome package inside the parallelized function everything is fine:

#------------------
library(GenomicRanges)
library(BSgenome.Dmelanogaster.UCSC.dm3)
gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+")
detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE)
gr.list <- lapply(1:6, function(i) gr )

seqs.list <- mclapply(gr.list, function(gr) {
  library(BSgenome.Dmelanogaster.UCSC.dm3)
  message("getSeq() started")
  s <- getSeq(Dmelanogaster, gr)  # always works
  message("getSeq() returned")
  s
}, mc.cores=2)
#------------------

I can reproduce this issue on both Mac and Linux (both 64-bit).

Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()?

Thanks,
Jeff Johnston
Zeitlinger Lab
Stowers Institute for Medical Research

#------------------
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0                        Biostrings_2.32.0                      XVector_0.4.0                         
[5] GenomicRanges_1.16.3                   GenomeInfoDb_1.0.2                     IRanges_1.22.8                         BiocGenerics_0.10.0                   
[9] setwidth_1.0-3                        

loaded via a namespace (and not attached):
[1] bitops_1.0-6     Rsamtools_1.16.0 stats4_3.1.0     zlibbioc_1.10.0 
#------------------



More information about the Bioconductor mailing list