[R] Regex Split?
    Ivan Krylov 
    kry|ov@r00t @end|ng |rom gm@||@com
       
    Fri May  5 12:24:36 CEST 2023
    
    
  
On Thu, 4 May 2023 23:59:33 +0300
Leonard Mada via R-help <r-help using r-project.org> wrote:
> strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> 
> strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T)
> # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> 
> strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
> perl=T)
> # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> 
> 
> Is this correct?
Perl seems to return the results you expect:
$ perl -E '
 say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef ,,gh])
 for (
  qr[ |(?=,)|(?<=,)(?![ ])],
  qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
  qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
)'
(?^u: |(?=,)|(?<=,)(?![ ])):
 "a" "bc" "," "def" "," "adef" "," "," "gh"
(?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
 "a" "bc" "," "def" "," "adef" "," "," "gh"
(?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
 "a" "bc" "," "def" "," "adef" "," "," "gh"
The same thing happens when I ask R to replace the separators instead
of splitting by them:
sapply(setNames(nm = c(
 " |(?=,)|(?<=,)(?![ ])",
 " |(?<! )(?=,)|(?<=,)(?![ ])",
 " |(?<! )(?=,)|(?<=,)(?=[^ ])")
), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
#               |(?=,)|(?<=,)(?![ ])         |(?<! )(?=,)|(?<=,)(?![ ]) 
# "a[]bc[],[]def[],[]adef[],[],[]gh" "a[]bc[],[]def[],[]adef[],[],[]gh" 
#        |(?<! )(?=,)|(?<=,)(?=[^ ]) 
# "a[]bc[],[]def[],[]adef[],[],[]gh" 
I think that something strange happens when the delimeter pattern
matches more than once in the same place:
gsub(
 '(?=<--)|(?<=-->)', '[]', 'split here --><-- split here',
 perl = TRUE
)
# [1] "split here -->[]<-- split here"
(Both Perl's split() and s///g agree with R's gsub() here, although I
would have accepted "split here -->[][]<-- split here" too.)
On the other hand, the following doesn't look right:
strsplit(
 'split here --><-- split here', '(?=<--)|(?<=-->)',
 perl = TRUE
)
# [[1]]
# [1] "split here -->" "<"              "-- split here"
The "<" is definitely not followed by "<--", and the rightmost "--" is
definitely not preceded by "-->".
Perhaps strsplit() incorrectly advances the match position after one
match?
-- 
Best regards,
Ivan
    
    
More information about the R-help
mailing list