[BioC] matchPattern vmatchPattern vectorised

Henderson, Stephen s.henderson at ucl.ac.uk
Sat Dec 14 15:13:37 CET 2013


Sorry

The second I sent my initial  post I realised I should have said Biostrings package for the vmatchPattern function.

Not BSgenome (which is the subject). 

Fortunately for me you seem to maintain both. So apologies for any confusion.

Stephen


________________________________________
From: Stephen [guest] <guest at bioconductor.org>
Sent: 14 December 2013 14:00
To: bioconductor at r-project.org; Henderson, Stephen
Cc: BSgenome Maintainer
Subject: matchPattern vmatchPattern vectorised

Hi
I am trying to write a package that will make a few shortcuts for my lazy coworkers. So I wrote a few bits of code that will find their primers in amongst a fastq of multiplexed reads (e.g 10-20).

Next I thought I would save them the trouble of copy pasting Primers, Chromosome, and Start into a shell script, by instead autogenerating the script - We have the excellent BSgenome and Mmusculus9 packages installed so this seems a good starting point:

So for the first primer this works well:
> system.time(vmatchPattern("CCAGCACTGTATAGCCGATC", Mmusculus))
   user  system elapsed
 45.853   2.702  50.273

This is fine for a single primer but it seems from the docs (and testing) that if I want to lookup 15 primers it will take 15 passes through the genome and 15x as long. About the same time it would take them to just copy them from their lab-books. I guess they could have a coffee...still...

My first question: Is there another function or package on BioC that I have missed that might help me with this? Or low level functions I should look at to build a vectorised search (exact match) through Mmusculus?

And second I guess is a feature suggestion: Why not allow matchPattern to pass once through the genome comparing a set(char vector , DNAStringSet etc) to the subject? This seems to require little extra computational load (I think).

And given the difficulty of using BLAST within R might be very useful extension.

thx
Stephen


 -- output of sessionInfo():

R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BSgenome_1.30.0                    Biostrings_2.30.1
 [4] GenomicRanges_1.14.4               XVector_0.2.0                      IRanges_1.20.6
 [7] BiocGenerics_0.8.0                 data.table_1.8.10                  dplyr_0.1
[10] hflights_0.1                       Rcpp_0.10.6

loaded via a namespace (and not attached):
 [1] assertthat_0.1 devtools_1.4.1 digest_0.6.4   evaluate_0.5.1 formatR_0.10   httr_0.2       knitr_1.5
 [8] memoise_0.1    RCurl_1.95-4.1 stats4_3.0.2   stringr_0.6.2  tools_3.0.2    whisker_0.3-2

--
Sent via the guest posting facility at bioconductor.org.



More information about the Bioconductor mailing list