[R] Maximum number of patterns and speed in grep

mdvaan mathijsdevaan at gmail.com
Mon Jul 23 17:17:39 CEST 2012


Hi,

I have a minor follow-up question:

In the example below, "ann" and "nn" in the third element of text are
matched. I would like to ignore all matches in which the character following
the match is one of [:alpha:]. How do I do this without removing the
"ignore.case = TRUE" argument of the strapply function?

So the output should be:

[[1]]
[1] "Santa Fe Gold Corp"

[[2]]
[1] "Starpharma Holdings"

[[3]]
NULL

Rather than:

[[1]]
[1] "Santa Fe Gold Corp"

[[2]]
[1] "Starpharma Holdings"

[[3]]
[1] "ann" "nn"

Thanks!


require(gsubfn)

# read in data 
data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header = T,
sep = ",") 

# define the object to be searched 
text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma
Holdings", "the annual earnings exceed those of last year") 

k <- 3000 # chunk size 

f <- function(from, text) { 
  to <- min(from + k - 1, nrow(data)) 
  r <- paste(data[seq(from, to), 1], collapse = "|") 
  r <- gsub("[().*?+{}]", "", r) 
  strapply(text, r, ignore.case = TRUE) 
} 
ix <- seq(1, nrow(data), k) 
out <- lapply(text, function(text) unlist(lapply(ix, f, text))) 



--
View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4637458.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list