[Rd] 'gsub' not perl compatible?

Robert McGehee rmcgehee at walleyetrading.net
Mon Jul 24 22:51:56 CEST 2017


Hi Jean-Luc,
FWIW, you're pointing out a common discrepancy between regex parsers, which is whether or not a regex parser advances after finding both a zero-length match and a non-zero-length match.

I think this article is especially helpful for understanding the nuances here, particularly the section "Advancing After a Zero-Length Regex Match". 
http://www.regular-expressions.info/zerolength.html

For this article, their test example was gsub("\\d*", "x", "x1"), which demonstrates the same difference as in your example (i.e. the answer can be either "xxx" or "xxxx" depending on the parser). They also specifically provide a note on R's gsub function that notes this discrepancy:

"The regexp functions in R and PHP are based on PCRE, so they avoid getting stuck on a zero-length match by backtracking like PCRE does. But the gsub() function to search-and-replace in R also skips zero-length matches at the position where the previous non-zero-length match ended, like Python does."
	
All that said, your larger point still seems valid, that we should expect to see behavior consistent with the PCRE parser when we specify perl=TRUE, even if that is a different answer than we get from R's default TRE parser when perl=FALSE. And to take perl out of the equation, I also verified your test directly with PCRE (8.39) on my Linux box using the `pcretest` command, and sure enough, pcretest shows four matches to your example, consistent with an answer of !a!!c! like you said. Perhaps at a minimum, the ?gsub or ?regex man page should add a blurb indicating that the perl=TRUE behavior differs from PCRE in the case of non-zero length matches adjacent to zero-length matches. Though I'm not sure if this difference is known and intentional or just a side effect of some other decision. R also supports adding perl options embedded in the pattern. For example '(?i)' makes the pattern case insensitive and '(?U)' turns of greedy matching. I could imagine having the behavior you noted depend on such an option as well, if someone was inclined to make a patch and didn't want to change existing behavior.

However, to rewrite your query to get the result you want, it seems you may unfortunately have to rewrite the query using two calls to gsub using something like this: 

> gsub("b?", "!", gsub("b", "bb", "abc"))
 [1] "!a!!c!"

--Robert


-----Original Message-----
From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Lipatz Jean-Luc
Sent: Friday, July 21, 2017 5:27 AM
To: r-devel at r-project.org
Subject: [Rd] 'gsub' not perl compatible?

Hi all,

Working on some SAS program conversions, I was testing this (3.4.0 Windows, but also 2.10.1 MacOsX):
gsub("b?","!","abc",perl=T)

which returns
[1] "!a!c!"

that I didn't understand.

Unfortunately, asked for the same thing SAS 9.4 replies : "!a!!c!", and so does Perl (Strawberry 5.26), a more logical answer for me.
Is there some problem with PCRE or some subtility that I didn't catch?

Results are similar with * instead of ?
and there is a similar issue with the lazy operator:
gsub("b??","!","abc",perl=T) gives : "!a!b!c!", while the other softwares give "!a!!!c!"


Thanks

Jean-Luc LIPATZ




	[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list