[R] Faster text search in document database than with grep?

Duncan Murdoch murdoch.duncan at gmail.com
Mon Aug 3 15:13:37 CEST 2015


On 03/08/2015 5:25 AM, Witold E Wolski wrote:
> I have a database of text documents (letter sequences). Several thousands
> of documents with approx. 1000-2000 letters each.
> 
> I need to find exact matches of short 3-15 letters sequences in those
> documents.
> 
> Without any regexp patterns the search of one 3-15 letter "words" takes in
> the order of 1s.
> 
> So for a database with several thousand documents it's an the order of
> hours.
> The naive approach would be to use mcmapply, but than on a standard
> hardware I am still in the same order and since R is an interactive
> programming environment this isn't a solution I would go for.
> 
> But aren't there faster algorithmic solutions? Can anyone point me please
> to an implementation  available in R.

You haven't shown us what you did, but it sounds far slower than I'd
expect.  I just used the code below to set up a database of 10000
documents of 2000 letters each, and searching those documents for "abc"
takes about 70 milliseconds:

database <- replicate(10000, paste(sample(letters, 2000, rep=TRUE),
collapse=""))

grep("abc", database, fixed=TRUE)

Duncan Murdoch



More information about the R-help mailing list