[R] Maximum number of patterns and speed in grep

Gabor Grothendieck ggrothendieck at gmail.com
Fri Jul 6 18:27:39 CEST 2012


On Fri, Jul 6, 2012 at 10:45 AM, mdvaan <mathijsdevaan at gmail.com> wrote:
> Hi,
>
> I am using R's grep function to find patterns in vectors of strings. The
> number of patterns I would like to match is 7,700 (of different sizes). I
> noticed that I get an error message when I do the following:
>
> data <- array()
> for (j in 1:length(x))
> {
> array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"),  x[j],
> value = T))
> }
>
> When I break this up into 4 chunks of patterns it works:
>
> data <- array()
> for (j in 1:length(x))
> {
> array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"),
> x[j], value = T))
> }
>
> My questions: what's the maximum size of the patterns argument in grep? Is
> there a way to do this faster? It is very slow.

Try strapplyc in gsubfn and see
  http://gsubfn.googlecode.com
for more info.

# test data
x <- c("abcd", "z", "dbef")

# re is regexp with 7700 alternatives
#  to test with
g <- expand.grid(letters, letters, letters)
gp <- do.call("paste0", g)
gp7700 <- head(gp, 7700)
re <- paste(gp7700, collapse = "|")

# grep gives error message
grep.out <- grep(re, x)

# strapplyc works
library(gsubfn)
which(sapply(strapplyc(x, re), length) > 0)


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list