[R] Regex: workaround for variable length negative lookbehind

Stefan Th. Gries stgries at gmail.com
Sun Nov 30 20:33:21 CET 2008


Hi all

I have the following regular expression problem: I want to find
complete elements of a vector that end in a repeated character but
where the repetition doesn't make up the whole word. That is, for the
vector vec:

vec<-c("aaaa", "baaa", "bbaa", "bbba", "baamm", "aa")

I would like to get
"baaa"
"bbaa"
"baamm"

>From tools where negative lookbehind can involve variable lengths, one
would think this would work:

grep("(?<!(?:\\1|^))(.)\\1{1,}$", vec, perl=T)

But then R doesn't like it that much ... I also know I can get it like this:

whole.word.rep <- grep("^(.)\\1{1,}$", vec, perl=T) # 1 6
rep.at.end <- grep("(.)\\1{1,}$", vec, perl=T) # 1 2 3 5 6
setdiff(rep.at.end, whole.word.rep) # 2 3 5

But is there a one-line grep thingy to do this?

Thx for any pointers,
STG



More information about the R-help mailing list