[R] Regular expressions: offsets of groups

Tue Sep 28 13:47:24 CEST 2010

On Tue, Sep 28, 2010 at 6:52 AM, Titus von der Malsburg
<malsburg at gmail.com> wrote:
> On Tue, Sep 28, 2010 at 9:46 AM, Michael Bedward
> <michael.bedward at gmail.com> wrote:
>> What Titus wants to do is akin to retrieving capturing groups from a
>> Matcher object in Java.
>
> Precisely.  Here's the description:
>
>  http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#start(int)
>
> Gabor's lookbehind trick solves some special cases but it's not the

The only limitation is that in the regular expressions supported by R
you cannot have repitition in the (<=...) portion but none of your
examples -- neither the one you gave nor the one below require that
since if the prior expression ends in X+ you can just use X.    Are
you sure it does not cover all your actual situations?

If you truly do have situations where that require repetition a
gregexpr plus gsubfn will do it in one line.   Parenthesize the
portion of the regular expression you want to capture and replace
every character in it with X (or some other character that does not
otherwise occur).  Then find the positions and lengths of strings of
X.

> gregexpr("X+", gsubfn("a(b+)", ~ gsub(".", "X", x), "abcdaabbcbbb"))
[[1]]
[1] 1 5
attr(,"match.length")
[1] 1 2

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com