[R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Sun Dec 14 07:21:15 CET 2008

Controlling the pointer is going to be very different from perl since the R functions are vectorized rather than focusing on a single string.

Here is one approach that will give all the matches and lengths (for the original problem at least):

> mystr <- paste(rep("1122", 10), collapse="")
> n <- nchar(mystr)
>
> mystr2 <- substr(rep(mystr,n), 1:n, n)
>
> tmp <- regexpr("^11221122", mystr2)
> (tmp + 1:n - 1)[tmp>0]
[1]  1  5  9 13 17 21 25 29 33
> attr(tmp,"match.length")[tmp>0]
[1] 8 8 8 8 8 8 8 8 8

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: Wacek Kusnierczyk [mailto:Waclaw.Marcin.Kusnierczyk at idi.ntnu.no]
> Sent: Saturday, December 13, 2008 5:27 PM
> To: Greg Snow
> Cc: R help
> Subject: Re: [Rd] gregexpr - match overlap mishandled (PR#13391)
>
> Greg Snow wrote:
> > Wacek,
> >
> >  I am curious as to why Brian and I (and possibly other responders)
> are held to a higher standard than the original poster.
> >
> (we have just had an offline communication, it should be fine to keep
> it
> that way)
>
> > My first question was a real question.  There are 2 main ways to do
> regular expression matching (possibly others as well), you describe one
> method with the pointer moving through the string to be matched, the
> other way moves the pointer through the pattern looking for all
> possible matches in the string that match so far (one method is DFA the
> other NFA, but I don't remember which is which and my copy of Friedl is
> at work and I'm not).  Perl and PCRE use the method that you describe,
> but the other method may be more efficient for finding all overlapping
> matches in some cases, but as far as I know (and there are plenty of
> things I don't know, including all the changes/new programs since I
> last read about this) the only programs that use the other type of
> matching don't return the actual matches, just a yes/no on at least one
> match.  So if the original poster has formed his opinion based on
> another program that uses that type of matching, I would be interested
> to know about it.  It would teach me something and also make it easier
> for all of us to better help the original poster in the future if we
> understand where he is coming from.
>
> that's right, there are the dfa and the nfa approaches (and more), and
> the latter is the regex-driven approach of perl and many other
> implementations.  they work differently, and also differ in what you
> can
> do with them.
>
> i was talking about updating the string pointer to the position where
> the next match in a global (iterative) matching process can start.
> while the nfa approach moves, in general, back and forth through the
> string, and the dfa approach steps through the string linearly keeping
> a
> list of possible matches, the end result is the same:  if there is a
> match, the next match will not start earlier than after the end of the
> current match.  what a dfa and an nfa engine will actually match given
> a
> text and a pattern may differ:
>
> perl -e 'print "ab" =~ /a|ab/g'
> # a
>
> echo ab | egrep -o 'a|ab'
> # ab
>
> but none of them will report overlapping matches:
>
> perl -e 'print "aaaaa" =~ /aa/g
> # 2 matches
>
> echo aaaaa | egrep -o aa
> # 2 matches
>
> (afaik, egrep uses a posix-compliant dfa engine)
>
> to achieve the effect of overlapping matches, you either need to
> manually move the pointer while the global match proceeds, or
> sequentially perform single matches on successive substrings of the
> input string (which can give you the same match more than once,
> though).  it appears that my earlier suggestion was flawed, the
> following is a bit cleaner:
>
> $string = # some string
> $pattern = # some pattern
>
> @matches = ();
> while ($string =~ /$pattern/g) { push @matches, [$-[0], $&];
> pos($string) -= (length($&) - 1) }
>
> after each successful match, it moves to the position right after the
> start of the successful match.
>
>
> the following will capture all possible matches for all alternatives in
> the regex (or so it seems):
>
> $string =~/(?:$pattern)(??{ push @matches, [$-[0], $&] })/g
>
> so that "aabb" =~ /(?:a|abb?)(??{ push @matches, [$-[0], $&] })/g will
> give 4 matches.
>
> again, not sure if this can be done within r.
>
> vQ
>
>