[R] Maximum number of patterns and speed in grep

Gabor Grothendieck ggrothendieck at gmail.com
Sun Jul 15 14:27:31 CEST 2012


On Fri, Jul 13, 2012 at 1:41 PM, mdvaan <mathijsdevaan at gmail.com> wrote:
> Here's some data (which should give you the error messages):
>
>     # read in data
>     data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header =
> T, sep = ",")
>
>     # first paste all data
>     data1 <- paste(data[,1], collapse = "|")
>
>     # second paste subsets of the data
>     data2a <- paste(data[1:750,1], collapse = "|")
>     data2b <- paste(data[751:1500,1], collapse = "|")
>
>     # define the object to be searched
>     text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma
> Holdings")
>
>     # match
>     strapplyc(text, data1)
>     strapplyc(text, data2a)
>     strapplyc(text, data2b)
>
> Thanks in advance!
>

Although it seems that strapplyc can handle larger regular expressions
than grep in R it seems neither can handle as many as in your example
so process it in chunks:

k <- 3000 # chunk size

f <- function(from, text) {
	to <- min(from + k - 1, nrow(data))
	r <- paste(data[seq(from, to), 1], collapse = "|")
	r <- gsub("[().*?+{}]", "", r)
	strapply(text, r)
}
ix <- seq(1, nrow(data), k)
out <- lapply(text, function(text) unlist(lapply(ix, f, text)))


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list