[R] regex

Ivan Krylov kry|ov@r00t @end|ng |rom gm@||@com
Tue Sep 17 09:14:24 CEST 2019


On Tue, 17 Sep 2019 08:48:43 +0200
Ivan Calandra <calandra using rgzm.de> wrote:

> CSVs <- list.files(path=..., pattern="\\.csv$") 
> w.files <- CSVs[grep(pattern="_w_", CSVs)]
> 
> Of course, what I would like to do is list only the interesting files 
> from the beginning, rather than subsetting the whole list of files.

One way to express that would be "_w_.*\\.csv$", meaning that the
filename has to have "_w_" in it, followed by anything (any character
repeated any number of times, including 0), followed by ".csv" at the
end of the line.

> 2) The units of the variables are given in the original headers. I
> would like to extract the units. This is what I did: headers <-
> c("dist to origin on curve [mm]","segment on section [mm]", "angle 1
> [degree]", "angle 2 [degree]","angle 3 [degree]") units.var <- 
> gsub(pattern="^.*\\[|\\]$", "", headers)
> 
> It seems to be to overly complicated using gsub(). Isn't there a way
> to extract what is interesting rather than deleting what is not?

Pure-R way: use regmatches() + regexpr(). Both regmatches and regexpr
take the character vector as an argument, so duplication is hard to
avoid:

units <- regmatches(headers, regexpr('\\[.*\\]', headers))

The stringr package has an str_match() function with a nicer interface:
str_match(headers, '\\[.*\\]') -> units.

Such "greedy" patterns containing ".*" present a few pitfalls, e.g.
looking for text in parentheses using the pattern "\\(.*\\)" in
"...(abc)...(def)..." will match the whole "(abc)...(def)" instead of
single groups "(abc)" and "(def)", but with your examples the pattern
should work as presented. One other option would be to ask for "[",
followed by zero or more characters that are not "]", followed by "]":
'\\[[^]]*\\]'.

-- 
Best regards,
Ivan



More information about the R-help mailing list