[R] regex

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Tue Sep 17 16:38:46 CEST 2019


https://stackoverflow.com/questions/3041320/regex-and-operator/37692545

On September 17, 2019 6:39:13 AM PDT, Ivan Calandra <calandra using rgzm.de> wrote:
>Thank you Ivan for your help!
>
>Your solution for the first problem is so simple I didn't even think 
>about it!
>What I find weird is that "_w_|\\.csv$" works as expected ("OR"), but
>is 
>there no way to combine two patterns with an "AND"?
>
>Your solution to the second problem is actually unfortunately even more
>
>complicated to me than the gsub() solution. But I'm glad I can learn 
>about regmatches() and regexpr()!
>
>Best,
>Ivan
>
>--
>Dr. Ivan Calandra
>TraCEr, laboratory for Traceology and Controlled Experiments
>MONREPOS Archaeological Research Centre and
>Museum for Human Behavioural Evolution
>Schloss Monrepos
>56567 Neuwied, Germany
>+49 (0) 2631 9772-243
>https://www.researchgate.net/profile/Ivan_Calandra
>
>On 17/09/2019 09:14, Ivan Krylov wrote:
>> On Tue, 17 Sep 2019 08:48:43 +0200
>> Ivan Calandra <calandra using rgzm.de> wrote:
>>
>>> CSVs <- list.files(path=..., pattern="\\.csv$")
>>> w.files <- CSVs[grep(pattern="_w_", CSVs)]
>>>
>>> Of course, what I would like to do is list only the interesting
>files
>>> from the beginning, rather than subsetting the whole list of files.
>> One way to express that would be "_w_.*\\.csv$", meaning that the
>> filename has to have "_w_" in it, followed by anything (any character
>> repeated any number of times, including 0), followed by ".csv" at the
>> end of the line.
>>
>>> 2) The units of the variables are given in the original headers. I
>>> would like to extract the units. This is what I did: headers <-
>>> c("dist to origin on curve [mm]","segment on section [mm]", "angle 1
>>> [degree]", "angle 2 [degree]","angle 3 [degree]") units.var <-
>>> gsub(pattern="^.*\\[|\\]$", "", headers)
>>>
>>> It seems to be to overly complicated using gsub(). Isn't there a way
>>> to extract what is interesting rather than deleting what is not?
>> Pure-R way: use regmatches() + regexpr(). Both regmatches and regexpr
>> take the character vector as an argument, so duplication is hard to
>> avoid:
>>
>> units <- regmatches(headers, regexpr('\\[.*\\]', headers))
>>
>> The stringr package has an str_match() function with a nicer
>interface:
>> str_match(headers, '\\[.*\\]') -> units.
>>
>> Such "greedy" patterns containing ".*" present a few pitfalls, e.g.
>> looking for text in parentheses using the pattern "\\(.*\\)" in
>> "...(abc)...(def)..." will match the whole "(abc)...(def)" instead of
>> single groups "(abc)" and "(def)", but with your examples the pattern
>> should work as presented. One other option would be to ask for "[",
>> followed by zero or more characters that are not "]", followed by
>"]":
>> '\\[[^]]*\\]'.
>>
>
>______________________________________________
>R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.



More information about the R-help mailing list