[R] Split String in regex while Keeping Delimiter

Wed Apr 12 19:47:58 CEST 2023

On Wed, 12 Apr 2023 08:29:50 +0000
Emily Bakker <emilybakker using outlook.com> wrote:

> Some example data:
> “leucocyten + gramnegatieve staven +++ grampositieve staven ++”
> “leucocyten – grampositieve coccen +”
>  
> I want to split the strings such that I get the following result:
> c(“leucocyten +”,  “gramnegatieve staven +++”,
>  “grampositieve staven ++”)
> c(“leucocyten –“, “grampositieve coccen +”)
>  
> I have tried strsplit with a regular expression with a positive
> lookahead, but I am not able to achieve the results that I want.

It sounds like you need positive look-behind, not look-ahead: split on
spaces only if they _follow_ one to three of '+' or '-'. Unfortunately,
repetition quantifiers like {n,m} or + are not directly supported in
look-behind expressions (nor in Perl itself). As a special case, you
can use \K, where anything to the left of \K is a zero-width positive
match:

x <- c(
 'leucocyten + gramnegatieve staven +++ grampositieve staven ++',
 'leucocyten - grampositieve coccen +'
)
strsplit(x, '[+-]{1,3}+\\K ', perl = TRUE)
# [[1]]
# [1] "leucocyten +"             "gramnegatieve staven +++"
#     "grampositieve staven ++" 
# 
# [[2]]
# [1] "leucocyten -"           "grampositieve coccen +"

-- 
Best regards,
Ivan

P.S. It looks like your e-mail client has transformed every quote
character into typographically-correct Unicode quotes “” and every
minus into an en dash, which makes it slightly harder to work with your
code, since typographically correct Unicode quotes are not R string
delimiters. Is it really – that you'd like to split upon, or is it -?