[R] Regex Split?

Fri May 5 21:25:15 CEST 2023

Dear Bill,

Indeed, there are other cases as well - as documented.

Various Regex sites give the warning to avoid the legacy syntax 
"[[:<:]]", so this is the alternative syntax:
strsplit(split="\\b(?=\\w)", "One, two; three!", perl=TRUE)
# "O"  "n"  "e"  ", " "t"  "w"  "o"  "; " "t"  "h"  "r"  "e"  "e" "!"

gsub("\\b(?=\\w)", "#", "One, two; three!", perl=TRUE)
# "#One, #two; #three!"

Sincerely,

Leonard

On 5/5/2023 6:19 PM, Bill Dunlap wrote:
> https://eu01.z.antigena.com/l/BgIBOxsm88PwDTBiTTrQ784MFk2oGZVOA3RMHiarAZuyoEemKrcnpfJeD8X0FgxRDG33qHZho~NriRCbhv9_Ffr3EOfqn2vpaNUAlCDjQ8nOyVUgPM2iGnHi-qpN54kl1YVO_gHimn0m2ZJ68ntGtysras~0mRMDuAgwbTXsQcQ~ 
> (from 2016, still labelled 'UNCONFIRMED") contains some other examples 
> of strsplit misbehaving when using 0-length perl look-behinds.  E.g.,
>
> > strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
>  [1] "O"  "n"  "e"  ", " "t"  "w"  "o"  "; " "t"  "h"  "r"  "e"  "e"  "!"
> > gsub(pattern="[[:<:]]", "#", "One, two; three!", perl=TRUE)
> [1] "#One, #two; #three!"
>
> The bug report includes the comment
> It may be possible that strsplit is not using the startoffset argument
> to pcre_exec
>
>    pcre/pcre/doc/html/pcreapi.html
>      A non-zero starting offset is useful when searching for another match
>      in the same subject by calling pcre_exec() again after a previous
>      success. Setting startoffset differs from just passing over a
>      shortened string and setting PCRE_NOTBOL in the case of a pattern that
>      begins with any kind of lookbehind.
>
> or it could be something else.
>
>
> On Fri, May 5, 2023 at 3:25 AM Ivan Krylov <krylov.r00t using gmail.com> wrote:
>
>     On Thu, 4 May 2023 23:59:33 +0300
>     Leonard Mada via R-help <r-help using r-project.org> wrote:
>
>     > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
>     > # "a"    "bc"   ","    "def"  ","    ""     "adef" "," "," "gh"
>     >
>     > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])",
>     perl=T)
>     > # "a"    "bc"   ","    "def"  ","    ""     "adef" "," "," "gh"
>     >
>     > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
>     > perl=T)
>     > # "a"    "bc"   ","    "def"  ","    ""     "adef" "," "," "gh"
>     >
>     >
>     > Is this correct?
>
>     Perl seems to return the results you expect:
>
>     $ perl -E '
>      say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef
>     ,,gh])
>      for (
>       qr[ |(?=,)|(?<=,)(?![ ])],
>       qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
>       qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
>     )'
>     (?^u: |(?=,)|(?<=,)(?![ ])):
>      "a" "bc" "," "def" "," "adef" "," "," "gh"
>     (?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
>      "a" "bc" "," "def" "," "adef" "," "," "gh"
>     (?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
>      "a" "bc" "," "def" "," "adef" "," "," "gh"
>
>     The same thing happens when I ask R to replace the separators instead
>     of splitting by them:
>
>     sapply(setNames(nm = c(
>      " |(?=,)|(?<=,)(?![ ])",
>      " |(?<! )(?=,)|(?<=,)(?![ ])",
>      " |(?<! )(?=,)|(?<=,)(?=[^ ])")
>     ), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
>     #               |(?=,)|(?<=,)(?![ ])         |(?<!
>     )(?=,)|(?<=,)(?![ ])
>     # "a[]bc[],[]def[],[]adef[],[],[]gh"
>     "a[]bc[],[]def[],[]adef[],[],[]gh"
>     #        |(?<! )(?=,)|(?<=,)(?=[^ ])
>     # "a[]bc[],[]def[],[]adef[],[],[]gh"
>
>     I think that something strange happens when the delimeter pattern
>     matches more than once in the same place:
>
>     gsub(
>      '(?=<--)|(?<=-->)', '[]', 'split here --><-- split here',
>      perl = TRUE
>     )
>     # [1] "split here -->[]<-- split here"
>
>     (Both Perl's split() and s///g agree with R's gsub() here, although I
>     would have accepted "split here -->[][]<-- split here" too.)
>
>     On the other hand, the following doesn't look right:
>
>     strsplit(
>      'split here --><-- split here', '(?=<--)|(?<=-->)',
>      perl = TRUE
>     )
>     # [[1]]
>     # [1] "split here -->" "<"              "-- split here"
>
>     The "<" is definitely not followed by "<--", and the rightmost "--" is
>     definitely not preceded by "-->".
>
>     Perhaps strsplit() incorrectly advances the match position after one
>     match?
>
>     -- 
>     Best regards,
>     Ivan
>
>     ______________________________________________
>     R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>     https://eu01.z.antigena.com/l/WZma5cGVT7M3Pi1uuAoPo_edV2O7qj81C7uavPIJ3LEMXNUs9d2H6DCGBB12hJA-6tmSLDAJFSwSMeHfx9~UdkUSOMRYZx7tgL1P4G1w4VXdaEBqiHCYYXMGh59CijZYZiIc53dOO~~YTK7T17MIVg-A4Mj5av2VVOt4XNt
>
>     PLEASE do read the posting guide
>     https://eu01.z.antigena.com/l/boS91wizs77ZHW7jjYQJGhwKWDd7jhs-Bz84RKSuLO6Wr42WQEw~jCTfuUJGa_hsJ~G48rDp4Yd3YqBk~W12~24~eoBAwV8FTFmlNLCyjnyym8S-Ebcq0yz2IaH5TEYHyBIe7Z52GHo7s2sQIpyl93Js_4_UaWCcc2uXHZs1
>
>     and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]