[R] Regex Split?

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Sat May 6 00:35:24 CEST 2023


Primarily for my own amusement, here is a way to do what I think you wanted
without look-aheads/behinds

strsplit(gsub("([[:punct:]])"," \\1 ","a bc,def, adef,x; ,,gh"), " +")
[[1]]
 [1] "a"    "bc"   ","    "def"  ","    "adef" ","    "x"    ";"
[10] ","    ","    "gh"

I certainly would *not* claim that it is in any way superior to anything
that has already been suggested -- indeed, probably the contrary. But it's
simple (as am I).

Cheers,
Bert

On Fri, May 5, 2023 at 2:54 PM Leonard Mada via R-help <r-help using r-project.org>
wrote:

> Dear Avi,
>
> Punctuation marks are used in various NLP language models. Preserving
> the "," is therefore useful in such scenarios and Regex are useful to
> accomplish this (especially if you have sufficient experience with such
> expressions).
>
> I observed only an odd behaviour using strsplit: the example string is
> constructed; but it is always wise to test a Regex expression against
> various scenarios. It is usually hard to predict what special cases will
> occur in a specific corpus.
>
> strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> # "a"  "bc"  ","  "def"  ","  ""  "adef"  ","  ","  "gh"
>
> stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?=,)|(?<=,)(?![ ])")
> # "a"    "bc"   ","    "def"  ","    "adef"  ""     ","    "," "gh"
>
> stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?<!
> )(?=,)|(?<=,)(?![ ])")
> # "a"    "bc"   ","    "def"  ","    "adef"  ","    ","    "gh"
>
> # Expected:
> # "a"  "bc"   ","  "def"   ","  "adef"  ","   ","  "gh"
> # see 2nd instance of stringi::stri_split
>
>
> Sincerely,
>
>
> Leonard
>
>
> On 5/5/2023 11:20 PM, avi.e.gross using gmail.com wrote:
> > Leonard,
> >
> > It can be helpful to spell out your intent in English or some of us have
> to go back to the documentation to remember what some of the operators do.
> >
> > Your text being searched seems to be an example of items between comas
> with an optional space after some commas and in one case, nothing between
> commas.
> >
> > So what is your goal for the example, and in general? You mention a bit
> unclearly at the end some of what you expect and I think it would be
> clearer if you also showed exactly the output you would want.
> >
> > I saw some other replies that addressed what you wanted and am going to
> reply in another direction.
> >
> > Why do things the hard way using things like lookahead or look behind?
> Would several steps get you the result way more clearly?
> >
> > For the sake of argument, you either want what reading in a CSV file
> would supply, or something else. Since you are not simply splitting on
> commas, it sounds like something else. But what exactly else? Something as
> simple as this on just a comma produces results including empty strings and
> embedded leading or trailing spaces:
> >
> > strsplit("a bc,def, adef ,,gh", ",")
> > [[1]]
> > [1] "a bc"   "def"    " adef " ""       "gh"
> >
> > That can of course be handled by, for example, trimming the result after
> unlisting the odd way strsplit returns results:
> >
> > library("stringr")
> > str_squish(unlist(strsplit("a bc,def, adef ,,gh", ",")))
> >
> > [1] "a bc" "def"  "adef" ""     "gh"
> >
> > Now do you want the empty string to be something else, such as an NA?
> That can be done too with another step.
> >
> > And a completely different variant can be used to read in your one-line
> CSV as text using standard overkill tools:
> >
> >> read.table(text="a bc,def, adef ,,gh", sep=",")
> >      V1  V2     V3 V4 V5
> > 1 a bc def  adef  NA gh
> >
> > The above is a vector of texts. But if you simply want to reassemble
> your initial string cleaned up a bit, you can use paste to put back commas,
> as in a variation of the earlier example:
> >
> >> paste(str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))),
> collapse=",")
> > [1] "a bc,def,adef,,gh"
> >
> > So my question is whether using advanced methods is really necessary for
> your case, or even particularly efficient. If efficiency matters, often, it
> is better to use tools without regular expressions such as paste0() when
> they meet your needs.
> >
> > Of course, unless I know what you are actually trying to do, my remarks
> may be not useful.
> >
> >
> >
> > -----Original Message-----
> > From: R-help <r-help-bounces using r-project.org> On Behalf Of Leonard Mada
> via R-help
> > Sent: Thursday, May 4, 2023 5:00 PM
> > To: R-help Mailing List <r-help using r-project.org>
> > Subject: [R] Regex Split?
> >
> > Dear R-Users,
> >
> > I tried the following 3 Regex expressions in R 4.3:
> > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])", perl=T)
> > # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> >
> >
> > Is this correct?
> >
> >
> > I feel that:
> > - none should return (after "def"): ",", "";
> > - the first one could also return "", "," (but probably not; not fully
> > sure about this);
> >
> >
> > Sincerely,
> >
> >
> > Leonard
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >
> https://eu01.z.antigena.com/l/boS91wizs77ZHrpn6fDgE-TZu7JxUnjyNg_9mZDUsLWLylcL-dhQytfeUHheLHZnKJw-VwwfCd_W4XdAukyKenqYPFzSJmP5FrWmF_wepejCrBByUVa66jUF7wKGiA8LnqB49ZUVq-urjKs272Rl-mj-SE1q7--Xj1UXRol3
> > PLEASE do read the posting guide
> https://eu01.z.antigena.com/l/rUS82cEKjOa3tTqQ7yTAXLpuOWG1NttoMdEKDQkk3EZhrLW63rsvJ77vuFxoc44Nwo7BGuQyBzF3bNlYLccamhXBk0shpe_1ZhOeonqIbTm59I58PKOPwwqUt6gLF2fLg3OmstDk7ueraKARO4qpUToOguMdYKyE2_LZnBk7QR
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list