[R] Split String in regex while Keeping Delimiter

Thu Apr 13 18:09:03 CEST 2023

Since any space that follows 2 or 3 + signs (or - signs) also follows
a single + (or -), this can be done with positive look behind, which
may be a little simpler:

x <- c(
  'leucocyten + gramnegatieve staven +++ grampositieve staven ++',
  'leucocyten - grampositieve coccen +'
)
strsplit(x, "(?<=[+-])\\s+", perl=TRUE)

An alternative is to use the strapply function(s) in the gsubfn
package which focus on what you want to keep for each piece rather
than what to split on.

Here is an example that says to keep a sequence of characters that are
not + or -, followed by 1 to 3 + or - characters:

library(gsubfn)
strapplyc(x, "[^+-]+[+-]{1,3}")

This includes the spaces at the beginning of the return strings after
the first, a couple of options that drop these spaces as well are:

strapply(x, "([^+-]+[+-]{1,3}) *", backref = -1)
strapply(x, "[^ +-][^+-]+[+-]{1,3}")

On Wed, Apr 12, 2023 at 11:54 AM Ivan Krylov <krylov.r00t using gmail.com> wrote:
>
> On Wed, 12 Apr 2023 08:29:50 +0000
> Emily Bakker <emilybakker using outlook.com> wrote:
>
> > Some example data:
> > “leucocyten + gramnegatieve staven +++ grampositieve staven ++”
> > “leucocyten – grampositieve coccen +”
> >
> > I want to split the strings such that I get the following result:
> > c(“leucocyten +”,  “gramnegatieve staven +++”,
> >  “grampositieve staven ++”)
> > c(“leucocyten –“, “grampositieve coccen +”)
> >
> > I have tried strsplit with a regular expression with a positive
> > lookahead, but I am not able to achieve the results that I want.
>
> It sounds like you need positive look-behind, not look-ahead: split on
> spaces only if they _follow_ one to three of '+' or '-'. Unfortunately,
> repetition quantifiers like {n,m} or + are not directly supported in
> look-behind expressions (nor in Perl itself). As a special case, you
> can use \K, where anything to the left of \K is a zero-width positive
> match:
>
> x <- c(
>  'leucocyten + gramnegatieve staven +++ grampositieve staven ++',
>  'leucocyten - grampositieve coccen +'
> )
> strsplit(x, '[+-]{1,3}+\\K ', perl = TRUE)
> # [[1]]
> # [1] "leucocyten +"             "gramnegatieve staven +++"
> #     "grampositieve staven ++"
> #
> # [[2]]
> # [1] "leucocyten -"           "grampositieve coccen +"
>
> --
> Best regards,
> Ivan
>
> P.S. It looks like your e-mail client has transformed every quote
> character into typographically-correct Unicode quotes “” and every
> minus into an en dash, which makes it slightly harder to work with your
> code, since typographically correct Unicode quotes are not R string
> delimiters. Is it really – that you'd like to split upon, or is it -?
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Gregory (Greg) L. Snow Ph.D.
538280 using gmail.com