[R] Regular expressions: retrieving matches depending on intervening strings

Stefan Th. Gries stgries_lists at arcor.de
Wed Aug 16 09:17:36 CEST 2006

Dear all

I again have a regular expression question. I have this character vector a:

a<-c("<w AT0>a <w NN1>blockage <w CJC>and <w DT0>that<c PUN>.",
     "<w AT0>a <w NN1>blockage <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.",
     "<w AT0>a <w NN1>blockage <w CJC>and<c PUN>, <w DT0>that<c PUN>.",
     "<w AT0>a <w NN1>blockage <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.")

I would like to retrieve those elements of a in which "<w CJC>" and "<w DT0>" are

- directly adjacent, as in a[1] or
- not interrupted by "<[wc] ", as in a[2]

And, of these elements I would like to consume all characters from the "<" in "<w CJC" to the last character after "<w DT0>" that is not a "<". For example, if I was only searching a[1], I would like something like this:

matches<-gregexpr("<w CJC>[^<]+?<w DT0>[^<]+", a[1], perl=TRUE)
substr(a[1], unlist(matches), unlist(matches)+unlist(attributes(matches[[1]], "match.length"))-1)

I have been fiddling around with negative lookahead but I really can't get my head around this. Any pointers would be greatly appreciated. Thanks a lot,
Stefan Th. Gries
University of California, Santa Barbara

More information about the R-help mailing list