[BioC] matchPattern vmatchPattern vectorised

Henderson, Stephen s.henderson at ucl.ac.uk
Sat Dec 14 17:05:42 CET 2013

thx Steve

I hadn't heard of gmapR (and GSNAP) I will check it now. The Rsubread package is a possible as I have it installed and planned to use it later in the pipeline so there is little extra overhead -- in my case.

However a quick look at the documentation and code (of both) suggests that they are creating indexed genomes, which is of course appropriate for multimillion read aligners.

Yet I still think there is a gap for a simple tool (such a vmatchPattern) that matches a small exact vector or StringSet of patterns - rather than just a singleton in one pass. It's a pretty common mol bio lab task (what with all the new multiplexing techs).



On Sat, Dec 14, 2013 at 6:00 AM, Stephen [guest] <guest at bioconductor.org> wrote:
> Hi
> I am trying to write a package that will make a few shortcuts for my lazy coworkers. So I wrote a few bits of code that will find their primers in amongst a fastq of multiplexed reads (e.g 10-20).
> Next I thought I would save them the trouble of copy pasting Primers, Chromosome, and Start into a shell script, by instead autogenerating the script - We have the excellent BSgenome and Mmusculus9 packages installed so this seems a good starting point:
> So for the first primer this works well:
>> system.time(vmatchPattern("CCAGCACTGTATAGCCGATC", Mmusculus))
>    user  system elapsed
>  45.853   2.702  50.273
> This is fine for a single primer but it seems from the docs (and testing) that if I want to lookup 15 primers it will take 15 passes through the genome and 15x as long. About the same time it would take them to just copy them from their lab-books. I guess they could have a coffee...still...
> My first question: Is there another function or package on BioC that I have missed that might help me with this? Or low level functions I should look at to build a vectorised search (exact match) through Mmusculus?

There are packages which wrap aligners that you might consider using:

* gmapR

* Rsubread:


Steve Lianoglou
Computational Biologist

More information about the Bioconductor mailing list