[R] Maximum number of patterns and speed in grep

mdvaan mathijsdevaan at gmail.com
Fri Jul 13 15:40:54 CEST 2012


Thanks, I see that it is working in the sample data. My data, however, gives
me an error message: 

data <- strapplyc(text, batch[[l]]) 
Error in structure(.External("dotTcl", ..., PACKAGE = "tcltk"), class =
"tclObj") : 
  [tcl] couldn't compile regular expression pattern: parentheses () not
balanced.

batch[[l]] is similar to your "re" string except that there is a larger
variety of characters. I haven't been able to figure out which characters
are causing trouble here. Any thoughts?

Thank you very much.

Math 




Gabor Grothendieck wrote
> 
> On Fri, Jul 6, 2012 at 10:45 AM, mdvaan <mathijsdevaan@> wrote:
>> Hi,
>>
>> I am using R's grep function to find patterns in vectors of strings. The
>> number of patterns I would like to match is 7,700 (of different sizes). I
>> noticed that I get an error message when I do the following:
>>
>> data <- array()
>> for (j in 1:length(x))
>> {
>> array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"),  x[j],
>> value = T))
>> }
>>
>> When I break this up into 4 chunks of patterns it works:
>>
>> data <- array()
>> for (j in 1:length(x))
>> {
>> array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"),
>> x[j], value = T))
>> array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse =
>> "|"),
>> x[j], value = T))
>> array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse =
>> "|"),
>> x[j], value = T))
>> array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse =
>> "|"),
>> x[j], value = T))
>> }
>>
>> My questions: what's the maximum size of the patterns argument in grep?
>> Is
>> there a way to do this faster? It is very slow.
> 
> Try strapplyc in gsubfn and see
>   http://gsubfn.googlecode.com
> for more info.
> 
> # test data
> x <- c("abcd", "z", "dbef")
> 
> # re is regexp with 7700 alternatives
> #  to test with
> g <- expand.grid(letters, letters, letters)
> gp <- do.call("paste0", g)
> gp7700 <- head(gp, 7700)
> re <- paste(gp7700, collapse = "|")
> 
> # grep gives error message
> grep.out <- grep(re, x)
> 
> # strapplyc works
> library(gsubfn)
> which(sapply(strapplyc(x, re), length) > 0)
> 
> 
> -- 
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> 
> ______________________________________________
> R-help@ mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

--
View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4636437.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list