[R] Regex Split?

Bill Dunlap w||||@mwdun|@p @end|ng |rom gm@||@com
Fri May 5 17:19:21 CEST 2023


https://bugs.r-project.org/show_bug.cgi?id=16745 (from 2016, still labelled
'UNCONFIRMED") contains some other examples of strsplit misbehaving when
using 0-length perl look-behinds.  E.g.,

> strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
 [1] "O"  "n"  "e"  ", " "t"  "w"  "o"  "; " "t"  "h"  "r"  "e"  "e"  "!"
> gsub(pattern="[[:<:]]", "#", "One, two; three!", perl=TRUE)
[1] "#One, #two; #three!"

The bug report includes the comment

It may be possible that strsplit is not using the startoffset argument
to pcre_exec

  pcre/pcre/doc/html/pcreapi.html
    A non-zero starting offset is useful when searching for another match
    in the same subject by calling pcre_exec() again after a previous
    success. Setting startoffset differs from just passing over a
    shortened string and setting PCRE_NOTBOL in the case of a pattern that
    begins with any kind of lookbehind.

or it could be something else.



On Fri, May 5, 2023 at 3:25 AM Ivan Krylov <krylov.r00t using gmail.com> wrote:

> On Thu, 4 May 2023 23:59:33 +0300
> Leonard Mada via R-help <r-help using r-project.org> wrote:
>
> > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
> > perl=T)
> > # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> >
> >
> > Is this correct?
>
> Perl seems to return the results you expect:
>
> $ perl -E '
>  say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef ,,gh])
>  for (
>   qr[ |(?=,)|(?<=,)(?![ ])],
>   qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
>   qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
> )'
> (?^u: |(?=,)|(?<=,)(?![ ])):
>  "a" "bc" "," "def" "," "adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
>  "a" "bc" "," "def" "," "adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
>  "a" "bc" "," "def" "," "adef" "," "," "gh"
>
> The same thing happens when I ask R to replace the separators instead
> of splitting by them:
>
> sapply(setNames(nm = c(
>  " |(?=,)|(?<=,)(?![ ])",
>  " |(?<! )(?=,)|(?<=,)(?![ ])",
>  " |(?<! )(?=,)|(?<=,)(?=[^ ])")
> ), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
> #               |(?=,)|(?<=,)(?![ ])         |(?<! )(?=,)|(?<=,)(?![ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh" "a[]bc[],[]def[],[]adef[],[],[]gh"
> #        |(?<! )(?=,)|(?<=,)(?=[^ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh"
>
> I think that something strange happens when the delimeter pattern
> matches more than once in the same place:
>
> gsub(
>  '(?=<--)|(?<=-->)', '[]', 'split here --><-- split here',
>  perl = TRUE
> )
> # [1] "split here -->[]<-- split here"
>
> (Both Perl's split() and s///g agree with R's gsub() here, although I
> would have accepted "split here -->[][]<-- split here" too.)
>
> On the other hand, the following doesn't look right:
>
> strsplit(
>  'split here --><-- split here', '(?=<--)|(?<=-->)',
>  perl = TRUE
> )
> # [[1]]
> # [1] "split here -->" "<"              "-- split here"
>
> The "<" is definitely not followed by "<--", and the rightmost "--" is
> definitely not preceded by "-->".
>
> Perhaps strsplit() incorrectly advances the match position after one
> match?
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list