[R] Regex Split?

Sat May 6 01:05:56 CEST 2023

Dear Bert,

Thank you for the suggestion. Indeed, there are various solutions and 
workarounds. However, there is still a bug in strsplit.

2.) gsub
I would try to avoid gsub on a Wikipedia-sized corpus: using strsplit 
directly should be far more efficient.

3.) Punctuation marks
Abbreviations and "word1-word2" may be a problem:
gsub("(?<ThePunct>[[:punct:]])", "\\1 ", "A.B.C.", perl=T)
# "A. B. C. "

I do not yet have an intuition if the spaces in "A. B. C. " would 
adversely affect the language model. But this goes off-topic.

Sincerely,

Leonard

On 5/6/2023 1:35 AM, Bert Gunter wrote:
> Primarily for my own amusement, here is a way to do what I think you 
> wanted without look-aheads/behinds
>
> strsplit(gsub("([[:punct:]])"," \\1 ","a bc,def, adef,x; ,,gh"), " +")
> [[1]]
>  [1] "a"    "bc"   ","    "def"  ","    "adef" ","    "x"  ";"
> [10] ","    ","    "gh"
>
> I certainly would *not* claim that it is in any way superior to 
> anything that has already been suggested -- indeed, probably the 
> contrary. But it's simple (as am I).
>
> Cheers,
> Bert
>
> On Fri, May 5, 2023 at 2:54 PM Leonard Mada via R-help 
> <r-help using r-project.org> wrote:
>
>     Dear Avi,
>
>     Punctuation marks are used in various NLP language models. Preserving
>     the "," is therefore useful in such scenarios and Regex are useful to
>     accomplish this (especially if you have sufficient experience with
>     such
>     expressions).
>
>     I observed only an odd behaviour using strsplit: the example
>     string is
>     constructed; but it is always wise to test a Regex expression against
>     various scenarios. It is usually hard to predict what special
>     cases will
>     occur in a specific corpus.
>
>     strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
>     # "a"  "bc"  ","  "def"  ","  ""  "adef"  ","  ","  "gh"
>
>     stringi::stri_split("a bc,def, adef ,,gh", regex="
>     |(?=,)|(?<=,)(?![ ])")
>     # "a"    "bc"   ","    "def"  ","    "adef"  ""     ","    "," "gh"
>
>     stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?<!
>     )(?=,)|(?<=,)(?![ ])")
>     # "a"    "bc"   ","    "def"  ","    "adef"  ","    "," "gh"
>
>     # Expected:
>     # "a"  "bc"   ","  "def"   ","  "adef"  ","   ","  "gh"
>     # see 2nd instance of stringi::stri_split
>
>
>     Sincerely,
>
>
>     Leonard
>
>
>     On 5/5/2023 11:20 PM, avi.e.gross using gmail.com wrote:
>     > Leonard,
>     >
>     > It can be helpful to spell out your intent in English or some of
>     us have to go back to the documentation to remember what some of
>     the operators do.
>     >
>     > Your text being searched seems to be an example of items between
>     comas with an optional space after some commas and in one case,
>     nothing between commas.
>     >
>     > So what is your goal for the example, and in general? You
>     mention a bit unclearly at the end some of what you expect and I
>     think it would be clearer if you also showed exactly the output
>     you would want.
>     >
>     > I saw some other replies that addressed what you wanted and am
>     going to reply in another direction.
>     >
>     > Why do things the hard way using things like lookahead or look
>     behind? Would several steps get you the result way more clearly?
>     >
>     > For the sake of argument, you either want what reading in a CSV
>     file would supply, or something else. Since you are not simply
>     splitting on commas, it sounds like something else. But what
>     exactly else? Something as simple as this on just a comma produces
>     results including empty strings and embedded leading or trailing
>     spaces:
>     >
>     > strsplit("a bc,def, adef ,,gh", ",")
>     > [[1]]
>     > [1] "a bc"   "def"    " adef " ""       "gh"
>     >
>     > That can of course be handled by, for example, trimming the
>     result after unlisting the odd way strsplit returns results:
>     >
>     > library("stringr")
>     > str_squish(unlist(strsplit("a bc,def, adef ,,gh", ",")))
>     >
>     > [1] "a bc" "def"  "adef" ""     "gh"
>     >
>     > Now do you want the empty string to be something else, such as
>     an NA? That can be done too with another step.
>     >
>     > And a completely different variant can be used to read in your
>     one-line CSV as text using standard overkill tools:
>     >
>     >> read.table(text="a bc,def, adef ,,gh", sep=",")
>     >      V1  V2     V3 V4 V5
>     > 1 a bc def  adef  NA gh
>     >
>     > The above is a vector of texts. But if you simply want to
>     reassemble your initial string cleaned up a bit, you can use paste
>     to put back commas, as in a variation of the earlier example:
>     >
>     >> paste(str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))),
>     collapse=",")
>     > [1] "a bc,def,adef,,gh"
>     >
>     > So my question is whether using advanced methods is really
>     necessary for your case, or even particularly efficient. If
>     efficiency matters, often, it is better to use tools without
>     regular expressions such as paste0() when they meet your needs.
>     >
>     > Of course, unless I know what you are actually trying to do, my
>     remarks may be not useful.
>     >
>     >
>     >
>     > -----Original Message-----
>     > From: R-help <r-help-bounces using r-project.org> On Behalf Of Leonard
>     Mada via R-help
>     > Sent: Thursday, May 4, 2023 5:00 PM
>     > To: R-help Mailing List <r-help using r-project.org>
>     > Subject: [R] Regex Split?
>     >
>     > Dear R-Users,
>     >
>     > I tried the following 3 Regex expressions in R 4.3:
>     > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
>     > # "a"    "bc"   ","    "def"  ","    ""     "adef" "," "," "gh"
>     >
>     > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])",
>     perl=T)
>     > # "a"    "bc"   ","    "def"  ","    ""     "adef" "," "," "gh"
>     >
>     > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
>     perl=T)
>     > # "a"    "bc"   ","    "def"  ","    ""     "adef" "," "," "gh"
>     >
>     >
>     > Is this correct?
>     >
>     >
>     > I feel that:
>     > - none should return (after "def"): ",", "";
>     > - the first one could also return "", "," (but probably not; not
>     fully
>     > sure about this);
>     >
>     >
>     > Sincerely,
>     >
>     >
>     > Leonard
>     >
>     > ______________________________________________
>     > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>     >
>     https://eu01.z.antigena.com/l/boS91wizs77ZHrpn6fDgE-TZu7JxUnjyNg_9mZDUsLWLylcL-dhQytfeUHheLHZnKJw-VwwfCd_W4XdAukyKenqYPFzSJmP5FrWmF_wepejCrBByUVa66jUF7wKGiA8LnqB49ZUVq-urjKs272Rl-mj-SE1q7--Xj1UXRol3
>     > PLEASE do read the posting guide
>     https://eu01.z.antigena.com/l/rUS82cEKjOa3tTqQ7yTAXLpuOWG1NttoMdEKDQkk3EZhrLW63rsvJ77vuFxoc44Nwo7BGuQyBzF3bNlYLccamhXBk0shpe_1ZhOeonqIbTm59I58PKOPwwqUt6gLF2fLg3OmstDk7ueraKARO4qpUToOguMdYKyE2_LZnBk7QR
>     > and provide commented, minimal, self-contained, reproducible code.
>
>     ______________________________________________
>     R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>     and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]