[R] Split String in regex while Keeping Delimiter

Thu Apr 13 01:19:32 CEST 2023

I always find regex puzzles amusing, so after changing the unicode
typo quotes and dashes to ascii, the following simple prescription,
similar to those proffered by others, seems to produce what you
requested with your example:

x <- c("leucocyten + gramnegatieve staven +++ grampositieve staven ++",
 "leucocyten - grampositieve coccen +")

strsplit(gsub("([^[:alnum:]]) ","\\1>>",x),">>")

(You can use unlist on this if you wish).

My slight variant uses character classes and a backreference to
identify where you want to split. The substitute '>>' split expression
is purely arbitrary of course. Instead of [^[:alnum:]] you could
probably use [+-], but I only wished to assume some sort of
non-alphanumeric.

I mention in passing that the above also seemed to work when I kept
the en dash instead of a minus sign, but I make no claim for
superiority -- or even noninferiority --  to the solutions proposed by
others.

Cheers,
Bert

On Wed, Apr 12, 2023 at 2:52 PM David Winsemius <dwinsemius using comcast.net> wrote:
>
> I thought replacing the spaces following instances of +++,++,+,- with "\n" and then reading with scan should succeed. Like Ivan Krylov I was fairly sure that you meant the minus sign to be "-" rather than "–", but perhaps your were using MS Word as an editor which is inconsistent with effective use of R. If so, learn to use a proper programming editor, and in any case learn to post to rhelp in plain text.
>
> --
> David
>
> scan(text=gsub("([-+]){1}\\s", "\\1\n", dat), what="", sep="\n")
>
>
>
> > On Apr 12, 2023, at 2:29 AM, Emily Bakker <emilybakker using outlook.com> wrote:
> >
> > Hello List,
> >
> > I have a dataset consisting of strings that I want to split while saving the delimiter.
> >
> > Some example data:
> > “leucocyten + gramnegatieve staven +++ grampositieve staven ++”
> > “leucocyten – grampositieve coccen +”
> >
> > I want to split the strings such that I get the following result:
> > c(“leucocyten +”,  “gramnegatieve staven +++”,  “grampositieve staven ++”)
> > c(“leucocyten –“, “grampositieve coccen +”)
> >
> > I have tried strsplit with a regular expression with a positive lookahead, but I am not able to achieve the results that I want.
> >
> > I have tried:
> > as.list(strsplit(x, split = “(?=[\\+-]{1,3}\\s)+, perl=TRUE)
> >
> > Which results in:
> > c(“leucocyten “, “+”,  “gramnegatieve staven “, “+”, “+”, “+”,  “grampositieve staven ++”)
> > c(“leucocyten “, “–“, “grampositieve coccen +”)
> >
> >
> > Is there a function or regular expression that will make this possible?
> >
> > Kind regards,
> > Emily
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.