[R] Regular expressions: offsets of groups

Michael Bedward michael.bedward at gmail.com
Wed Sep 29 02:42:55 CEST 2010


Ah, that's interesting - thanks Bill. That's certainly on the right
track for me (Titus, you too ?) especially if the subpattern argument
accepted a vector of multiple group indices.

As you say, this is straightforward in C. I'd be happy to (try to)
make a patch for the R sources if there was some consensus on the best
way to implement it, ie. as a new R function or by extending existing
function(s).

Michael

On 29 September 2010 01:46, William Dunlap wrote:
>
> S+ has a subpattern=number argument to regexpr and
> related functions.  It means that the text matched
> by the subpattern'th parenthesized expression in the
> pattern will be considered the matched text.  E.g.,
> to find runs of b's that come immediately after a's:
>
>  > gregexpr("a+(b+)", "abcdaabbc", subpattern=1)
>  [[1]]:
>  [1] 2 7
>  attr(, "match.length"):
>  [1] 1 2
>
> or to find bc's that come after 2 or more ab's
>  > gregexpr("(ab){2,}bc", "abbcabababbcabcababbc", subpattern=1)
>
> regexpr() and strsplit() have this argument in S+ 8.1 but
> gregexpr() is not yet in a released version of S+.
>
> subpattern=0, the default, means to use the entire
> pattern.  regexpr allows subpattern=-1, which means
> to return a list with one element for each subpattern.
> I don't know if the extra complexity is worth it.
> (gregexpr does not allow subpattern=-1.)
>
> The usual C regexec() returns this information.
> Perhaps it would be handy to have it in R.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>



More information about the R-help mailing list