[R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Sat Dec 13 00:08:23 CET 2008


Greg Snow wrote:
> Where do you get "should" and "expect" from?  All the regular expression tools that I am familiar with only match non-overlapping patterns unless you do extra to specify otherwise.  One of the standard references for regular expressions if you really want to understand what is going on is "Mastering Regular Expressions" by Jeffrey Friedl.  You should really read through that book before passing judgment on the correctness of an implementation.
>
> If you want the overlaps, you need to come up with a regular expression that will match without consuming all of the string.  Here is one way to do it with your example:
>
>  > gregexpr("1122(?=1122)", paste(rep("1122", 10), collapse=""), perl=TRUE)
> [[1]]
> [1]  1  5  9 13 17 21 25 29 33
> attr(,"match.length")
> [1] 4 4 4 4 4 4 4 4 4
>
>   

another option would be to move the anchor backwards after each match,
but i'm not sure if the problem really needs it and if it could be done
from within r.

greg (and another person who answered this post earlier):
while your frustration is understandable, i think reid (and possibly
other users as well) would benefit from a brief explanation instead of
your emotional reactions.  you ought to be more patient and less
arrogant with newbies who will often think there is a bug in r when
there isn't.

reid:
when matching is performed, there is a pointer moved through the
string.  in global matching, after a match is found the pointer is just
behind the matched substring, and further matching proceeds from there. 
for example example, suppose you match "aaa" (the string) with "aa" (the
pattern) globally.  after the first successful match, the position
pointer is *behind the second a* in the string, and no further match can
be found from there.in this context, 'global' does not mean that all
possible matches are found, rather that matching is performed iteratively.

the above is probably a solution to your problem, though the matches
have length 4, not 8.  in perl, you could manually move back the anchor
after each match, e.g.:

$string = "1122" x 10;
$n = length($string)/2;
@matches = ();
$string =~ /11221122(??{push @matches [$-[0], $&]; pos($s) -= $n})/g;

now @matches has 9 elements, each a ref to an array with the starting
position and the content (of length 8) of the respective match:
@matches = ([0, "11221122"], [4, "11221122"], ...)

not sure if you can do this within r.  not sure if you'll ever need it. 
for more complex cases when you need overlapping matches and you need
their content, greg's solution might not do, but in general that's the
solution.

vQ



More information about the R-help mailing list