[R] Faster text search in document database than with grep?

Witold E Wolski wewolski at gmail.com
Mon Aug 3 15:45:30 CEST 2015


Dear Duncan,

This is a model of the data I work with.

database <- replicate(50000, paste(sample(letters,rexp(1,1/500), rep=TRUE),
                                   collapse=""))

words <- replicate(10000,paste(sample(letters,rexp(1,1/70), rep=TRUE),
                                       collapse=""))

NumberOfWords <- 10
system.time(lapply(words[1: NumberOfWords], grep, database))
   user  system elapsed
  5.002   0.003   5.005

 The model reproduces the running times I have to cope with.

To use grep in this context is rather naive and I am wondering if there are
better solutions availabe in R.



On 3 August 2015 at 15:13, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

> On 03/08/2015 5:25 AM, Witold E Wolski wrote:
> > I have a database of text documents (letter sequences). Several thousands
> > of documents with approx. 1000-2000 letters each.
> >
> > I need to find exact matches of short 3-15 letters sequences in those
> > documents.
> >
> > Without any regexp patterns the search of one 3-15 letter "words" takes
> in
> > the order of 1s.
> >
> > So for a database with several thousand documents it's an the order of
> > hours.
> > The naive approach would be to use mcmapply, but than on a standard
> > hardware I am still in the same order and since R is an interactive
> > programming environment this isn't a solution I would go for.
> >
> > But aren't there faster algorithmic solutions? Can anyone point me please
> > to an implementation  available in R.
>
> You haven't shown us what you did, but it sounds far slower than I'd
> expect.  I just used the code below to set up a database of 10000
> documents of 2000 letters each, and searching those documents for "abc"
> takes about 70 milliseconds:
>
> database <- replicate(10000, paste(sample(letters, 2000, rep=TRUE),
> collapse=""))
>
> grep("abc", database, fixed=TRUE)
>
> Duncan Murdoch
>



-- 
Witold Eryk Wolski

	[[alternative HTML version deleted]]



More information about the R-help mailing list