[R] regex

Tue Sep 17 16:42:33 CEST 2019

(For the units)

Why not simply:

sub(".*\\[(.+)\\]","\\1", headers)

Cheers,
Bert

On Tue, Sep 17, 2019 at 6:40 AM Ivan Calandra <calandra using rgzm.de> wrote:

> Thank you Ivan for your help!
>
> Your solution for the first problem is so simple I didn't even think
> about it!
> What I find weird is that "_w_|\\.csv$" works as expected ("OR"), but is
> there no way to combine two patterns with an "AND"?
>
> Your solution to the second problem is actually unfortunately even more
> complicated to me than the gsub() solution. But I'm glad I can learn
> about regmatches() and regexpr()!
>
> Best,
> Ivan
>
> --
> Dr. Ivan Calandra
> TraCEr, laboratory for Traceology and Controlled Experiments
> MONREPOS Archaeological Research Centre and
> Museum for Human Behavioural Evolution
> Schloss Monrepos
> 56567 Neuwied, Germany
> +49 (0) 2631 9772-243
> https://www.researchgate.net/profile/Ivan_Calandra
>
> On 17/09/2019 09:14, Ivan Krylov wrote:
> > On Tue, 17 Sep 2019 08:48:43 +0200
> > Ivan Calandra <calandra using rgzm.de> wrote:
> >
> >> CSVs <- list.files(path=..., pattern="\\.csv$")
> >> w.files <- CSVs[grep(pattern="_w_", CSVs)]
> >>
> >> Of course, what I would like to do is list only the interesting files
> >> from the beginning, rather than subsetting the whole list of files.
> > One way to express that would be "_w_.*\\.csv$", meaning that the
> > filename has to have "_w_" in it, followed by anything (any character
> > repeated any number of times, including 0), followed by ".csv" at the
> > end of the line.
> >
> >> 2) The units of the variables are given in the original headers. I
> >> would like to extract the units. This is what I did: headers <-
> >> c("dist to origin on curve [mm]","segment on section [mm]", "angle 1
> >> [degree]", "angle 2 [degree]","angle 3 [degree]") units.var <-
> >> gsub(pattern="^.*\\[|\\]$", "", headers)
> >>
> >> It seems to be to overly complicated using gsub(). Isn't there a way
> >> to extract what is interesting rather than deleting what is not?
> > Pure-R way: use regmatches() + regexpr(). Both regmatches and regexpr
> > take the character vector as an argument, so duplication is hard to
> > avoid:
> >
> > units <- regmatches(headers, regexpr('\\[.*\\]', headers))
> >
> > The stringr package has an str_match() function with a nicer interface:
> > str_match(headers, '\\[.*\\]') -> units.
> >
> > Such "greedy" patterns containing ".*" present a few pitfalls, e.g.
> > looking for text in parentheses using the pattern "\\(.*\\)" in
> > "...(abc)...(def)..." will match the whole "(abc)...(def)" instead of
> > single groups "(abc)" and "(def)", but with your examples the pattern
> > should work as presented. One other option would be to ask for "[",
> > followed by zero or more characters that are not "]", followed by "]":
> > '\\[[^]]*\\]'.
> >
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]