[R] regexp mystery

Tue Oct 16 11:23:54 CEST 2018

Hi

Thanks a lot for your insightful answer. I will need to study it in detail, gregexpr and regexpr seems to be quite handy for what I need.

Cheers
Petr

> -----Original Message-----
> From: Ivan Krylov <krylov.r00t using gmail.com>
> Sent: Tuesday, October 16, 2018 11:08 AM
> To: PIKAL Petr <petr.pikal using precheza.cz>
> Cc: r-help using r-project.org
> Subject: Re: [R] regexp mystery
>
> On Tue, 16 Oct 2018 08:36:27 +0000
> PIKAL Petr <petr.pikal using precheza.cz> wrote:
>
> > > dput(x[11])
> > "et odYezko: 3                     \fas odYezku:   15 s"
>
> > gsub("^.*: (\\d+).*$", "\\1", x[11])
> > works for 3
>
> This regular expression only matches one space between the colon and the
> number, but you have more than one of them before "15".
>
> > gsub("^.*[^:] (\\d+).*$", "\\1", x[11]) works for 15
>
> Match succeeds because a space is not a colon:
>
>  ^.* matches "et odYezko: 3                     \fas odYezku:  "
>  [^:] matches space " "
>  space " " matches another space " "
>  finally, (\\d+) matches "15"
>  and .*$ matches " s"
>
> If you need just the numbers, you might have more success by extracting
> matches directly with gregexpr and regmatches:
>
> (
> function(s) regmatches(
> s,
> gregexpr("\\d+(\\.\\d+)?", s)
> )
> )("et odYezko: 3                     \fas odYezku:   15 s")
>
> [[1]]
> [1] "3"  "15"
>
> (I'm creating an anonymous function and evaluating it immediately because I
> need to pass the same string to both gregexpr and regmatches.)
>
> If you need to capture numbers appearing in a specific context, a better regular
> expression suiting your needs might be
>
> ":\\s*(\\d+(?:\\.\\d+)?)"
>
> (A colon, followed by optional whitespace, followed by a number to capture,
> consisting of decimals followed by optional, non-captured dot followed by
> decimals)
>
> but I couldn't find a way to extract captures from repeated match by using
> vanilla R pattern matching (it's either regexec which returns captures for the
> first match or gregexpr which returns all matches but without the captures). If
> you can load the stringr package, it's very easy, though:
>
> str_match_all(
> c(
> "PYedehYev:  300 s              Záva~í: 2.160 kg",
> "et odYezko: 3               \fas odYezku:   15 s"
> ),
> ":\\s*(\\d+(?:\\.\\d+)?)"
> )
> [[1]]
>      [,1]      [,2]
> [1,] ":  300"  "300"
> [2,] ": 2.160" "2.160"
>
> [[2]]
>      [,1]     [,2]
> [1,] ": 3"    "3"
> [2,] ":   15" "15"
>
> Column 2 of each list item contains the requested captures.
>
> --
> Best regards,
> Ivan
Osobní údaje: Informace o zpracování a ochraně osobních údajů obchodních partnerů PRECHEZA a.s. jsou zveřejněny na: https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about processing and protection of business partner’s personal data are available on website: https://www.precheza.cz/en/personal-data-protection-principles/
Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a podléhají tomuto právně závaznému prohláąení o vyloučení odpovědnosti: https://www.precheza.cz/01-dovetek/ | This email and any documents attached to it may be confidential and are subject to the legally binding disclaimer: https://www.precheza.cz/en/01-disclaimer/