[R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Sun Dec 14 01:26:35 CET 2008

Greg Snow wrote:
> Wacek,
>
>  I am curious as to why Brian and I (and possibly other responders) are held to a higher standard than the original poster.
>   
(we have just had an offline communication, it should be fine to keep it
that way)

> My first question was a real question.  There are 2 main ways to do regular expression matching (possibly others as well), you describe one method with the pointer moving through the string to be matched, the other way moves the pointer through the pattern looking for all possible matches in the string that match so far (one method is DFA the other NFA, but I don't remember which is which and my copy of Friedl is at work and I'm not).  Perl and PCRE use the method that you describe, but the other method may be more efficient for finding all overlapping matches in some cases, but as far as I know (and there are plenty of things I don't know, including all the changes/new programs since I last read about this) the only programs that use the other type of matching don't return the actual matches, just a yes/no on at least one match.  So if the original poster has formed his opinion based on another program that uses that type of matching, I would be interested to know about it.  It would teach me something and also make it easier for all of us to better help the original poster in the future if we understand where he is coming from.

that's right, there are the dfa and the nfa approaches (and more), and
the latter is the regex-driven approach of perl and many other
implementations.  they work differently, and also differ in what you can
do with them.

i was talking about updating the string pointer to the position where
the next match in a global (iterative) matching process can start. 
while the nfa approach moves, in general, back and forth through the
string, and the dfa approach steps through the string linearly keeping a
list of possible matches, the end result is the same:  if there is a
match, the next match will not start earlier than after the end of the
current match.  what a dfa and an nfa engine will actually match given a
text and a pattern may differ:

perl -e 'print "ab" =~ /a|ab/g'
# a

echo ab | egrep -o 'a|ab'
# ab

but none of them will report overlapping matches:

perl -e 'print "aaaaa" =~ /aa/g
# 2 matches

echo aaaaa | egrep -o aa
# 2 matches

(afaik, egrep uses a posix-compliant dfa engine) 

to achieve the effect of overlapping matches, you either need to
manually move the pointer while the global match proceeds, or
sequentially perform single matches on successive substrings of the
input string (which can give you the same match more than once,
though).  it appears that my earlier suggestion was flawed, the
following is a bit cleaner:

$string = # some string
$pattern = # some pattern

@matches = ();
while ($string =~ /$pattern/g) { push @matches, [$-[0], $&];
pos($string) -= (length($&) - 1) }

after each successful match, it moves to the position right after the
start of the successful match.

the following will capture all possible matches for all alternatives in
the regex (or so it seems):

$string =~/(?:$pattern)(??{ push @matches, [$-[0], $&] })/g

so that "aabb" =~ /(?:a|abb?)(??{ push @matches, [$-[0], $&] })/g will
give 4 matches.

again, not sure if this can be done within r.

vQ