[R] Minimal match to regexp?

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Thu Jan 26 11:12:33 CET 2023


I'll submit a bug report.

On 25/01/2023 8:38 p.m., Andrew Simmons wrote:
> It seems like a bug to me. Using perl = TRUE, I see the desired result:
> 
> ```
> x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n"
> 
> pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"
> 
> cat(regmatches(x, regexpr(pattern2, x, perl = TRUE)))
> ```
> 
> If you change it to something like:
> 
> ```
> x <- c(
>      "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n",
>      "\n```html\nblah blah \n```\n"
> )
> 
> pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"
> 
> print(regmatches(x, regexpr(pattern2, x)), width = 10)
> ```
> 
> you can see that it does find the match, so the combination of *? and
> \\1 must be messing up regexpr(). They seem to work perfectly fine on
> their own.
> 
> On Wed, Jan 25, 2023 at 7:57 PM Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>
>> Thanks for pointing out my mistake.  I oversimplified the real problem.
>>
>> I'll try to post a version of it that comes closer:  Suppose I have a
>> string like this:
>>
>> x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n"
>>
>> If I cat() it, I see that it is really markdown source:
>>
>>     ```html
>>     blah blah
>>     ```
>>
>>     ```r
>>     blah blah
>>     ```
>>
>> I want to find the part that includes the html block, but not the r
>> block.  So I want to match "```html", followed by a minimal number of
>> characters, then "```".  Then this pattern works:
>>
>>     pattern <- "\n```html\n.*?\n```\n"
>>
>> and we get the right answer:
>>
>>     cat(regmatches(x, regexpr(pattern, x)))
>>
>>     ```html
>>     blah blah
>>     ```
>>
>> Okay, but this flavour of markdown says there can be more backticks, not
>> just 3.  So the block might look like
>>
>>     ````html
>>     blah blah
>>     ````
>>
>> I need to have the same number of backticks in the opening and closing
>> marker.  So I make the pattern more complicated, and it doesn't work:
>>
>>     pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"
>>
>> This matches all of x:
>>
>>     > pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"
>>     > cat(regmatches(x, regexpr(pattern2, x)))
>>
>>     ```html
>>     blah blah
>>     ```
>>
>>     ```r
>>     blah blah
>>     ```
>>
>>
>> Is that a bug, or am I making a silly mistake again?
>>
>> Duncan Murdoch
>>
>>
>>
>> On 25/01/2023 7:34 p.m., Andrew Simmons wrote:
>>> grep(value = TRUE) just returns the strings which match the pattern. You
>>> have to use regexpr() or gregexpr() if you want to know where the
>>> matches are:
>>>
>>> ```
>>> x <- "abaca"
>>>
>>> # extract only the first match with regexpr()
>>> m <- regexpr("a.*?a", x)
>>> regmatches(x, m)
>>>
>>> # or
>>>
>>> # extract every match with gregexpr()
>>> m <- gregexpr("a.*?a", x)
>>> regmatches(x, m)
>>> ```
>>>
>>> You could also use sub() to remove the rest of the string:
>>> `sub("^.*(a.*?a).*$", "\\1", x)`
>>> keeping only the match within the parenthesis.
>>>
>>>
>>> On Wed, Jan 25, 2023, 19:19 Duncan Murdoch <murdoch.duncan using gmail.com
>>> <mailto:murdoch.duncan using gmail.com>> wrote:
>>>
>>>      The docs for ?regexp say this:  "By default repetition is greedy, so
>>>      the
>>>      maximal possible number of repeats is used. This can be changed to
>>>      ‘minimal’ by appending ? to the quantifier. (There are further
>>>      quantifiers that allow approximate matching: see the TRE
>>>      documentation.)"
>>>
>>>      I want the minimal match, but I don't seem to be getting it.  For
>>>      example,
>>>
>>>      x <- "abaca"
>>>      grep("a.*?a", x, value = TRUE)
>>>      #> [1] "abaca"
>>>
>>>      Shouldn't I have gotten "aba", which is the first match to "a.*a"?  If
>>>      not, what would be the regexp that would give me the first match to
>>>      "a.*a", without greedy expansion of the .*?
>>>
>>>      Duncan Murdoch
>>>
>>>      ______________________________________________
>>>      R-help using r-project.org <mailto:R-help using r-project.org> mailing list --
>>>      To UNSUBSCRIBE and more, see
>>>      https://stat.ethz.ch/mailman/listinfo/r-help
>>>      <https://stat.ethz.ch/mailman/listinfo/r-help>
>>>      PLEASE do read the posting guide
>>>      http://www.R-project.org/posting-guide.html
>>>      <http://www.R-project.org/posting-guide.html>
>>>      and provide commented, minimal, self-contained, reproducible code.
>>>
>>



More information about the R-help mailing list