[Rd] Named capture in regexp

Toby Dylan Hocking Toby.Hocking at inria.fr
Fri Feb 25 22:12:01 CET 2011


Dear R core developers,

One feature from Python that I have been wanting in R is the ability
to capture groups in regular expressions using names. Consider the
following example in R.

> notables <- c("  Ben Franklin and Jefferson Davis","\tMillard Fillmore")
> name.rex <- "(?<first>[A-Z][a-z]+) (?<last>[A-Z][a-z]+)"
> (parsed <- regexpr(name.rex,notables,perl=TRUE))
[1] 3 2
attr(,"match.length")
[1] 12 16
attr(,"capture.start")
     [,1] [,2]
[1,]    3    7
[2,]    2   10
attr(,"capture.length")
     [,1] [,2]
[1,]    3    8
[2,]    7    8
attr(,"capture.names")
[1] "first" "last" 
> parse.one(notables,parsed)
     first     last      
[1,] "Ben"     "Franklin"
[2,] "Millard" "Fillmore"
> parse.one(notables,parsed)[,"last"]
[1] "Franklin" "Fillmore"

The advantage to this approach is that you can tag groups by name, and
then use the names later in the code to extract the matched substrings.

I realized this is possible by using the PCRE library which ships with
R, so in the last couple days I hacked a bit in src/main/grep.c in the
R source code. I managed to get named capture to work with the
standard gregexpr and regexpr functions. For backwards-compatibility,
my strategy was to just add more attributes to the results of these
functions, as shown above.

Attached is the patch and some R code for testing the new features. It
works fine for me with no memory problems. However, I noticed that
there is some UTF8 handling code, which I did not touch (use_UTF8 is
false on my machine). I presume we will need to make some small
modifications to get it to work with unicode, but I'm not sure how to
do them.

Would you consider integrating this patch into the R source code for
future releases, so the larger R community can take advantage of this
feature? If there's anything else I can do to help please let me know.

Sincerely,
Toby Dylan Hocking
http://cbio.ensmp.fr/~thocking/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: grep-named-capture.patch
Type: text/x-patch
Size: 9016 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20110225/819df4b2/attachment.bin>
-------------- next part --------------
### Toby Dylan Hocking, 25 feb 2011. Some R code to test my
### implementation of new named capture group features in gregexpr
gctorture(FALSE)##for debugging

### Parse result of gregexpr(,string)
result2list <- function(string,result){
  extract.substrings <- function(string,starts,lengths,names){
    subs <- substring(string,starts,starts+lengths-1)
    m <- matrix(subs,ncol=length(names))
    colnames(m) <- names
    m
  }
  N <- attr(result,"capture.names")
  lapply(seq_along(result),function(i){
    extract.substrings(string[i],
                       attr(result[[i]],"capture.start"),
                       attr(result[[i]],"capture.length"),
                       N)
  })
}

### Parse result of regexpr(,string)
parse.one <- function(string,result){
  m <- do.call(rbind,lapply(seq_along(string),function(i){
    st <- attr(result,"capture.start")[i,]
    substring(string[i],st,st+attr(result,"capture.length")[i,]-1)
  }))
  colnames(m) <- attr(result,"capture.names")
  m
}
string <- c("another foobar bazing string",
            "fbar baz foooobar baz",
            "no matches here",
            "my foobar baz st another fooooobar baz dude fobar baz")
notables <- c("  Ben Franklin and Jefferson Davis","\tMillard Fillmore")
name.rex <- "(?<first>[A-Z][a-z]+) (?<last>[A-Z][a-z]+)"
for(i in 1:100)
parsed <- gregexpr(name.rex,notables,perl=TRUE)
parsed[[1]]
result2list(notables,parsed)
(parsed <- regexpr(name.rex,notables,perl=TRUE))
parse.one(notables,parsed)
parse.one(notables,parsed)[,"last"]
result <- gregexpr("f(?<os_in_foo>o*)b(?<as_in_bar>a*)r (baz)",string,perl=TRUE)
print(result)
s2 <- paste(rep("foobar",1030),collapse=" ")
result <- gregexpr("f(?<os_in_foo>o*)b(?<as_in_bar>a*)r",s2,perl=TRUE)

## negative controls
regexpr(name.rex,notables)##perl not TRUE, bad regexp
regexpr("([A-Z][a-z]+) ([A-Z][a-z]+)",notables)##still works like usual


More information about the R-devel mailing list