[R] regex

Tue Sep 17 16:46:08 CEST 2019

Thanks Jeff!
It does indeed make sense that there is no "AND" corresponding to the "|".

Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra

On 17/09/2019 16:38, Jeff Newmiller wrote:
> https://stackoverflow.com/questions/3041320/regex-and-operator/37692545
>
> On September 17, 2019 6:39:13 AM PDT, Ivan Calandra <calandra using rgzm.de> wrote:
>> Thank you Ivan for your help!
>>
>> Your solution for the first problem is so simple I didn't even think
>> about it!
>> What I find weird is that "_w_|\\.csv$" works as expected ("OR"), but
>> is
>> there no way to combine two patterns with an "AND"?
>>
>> Your solution to the second problem is actually unfortunately even more
>>
>> complicated to me than the gsub() solution. But I'm glad I can learn
>> about regmatches() and regexpr()!
>>
>> Best,
>> Ivan
>>
>> --
>> Dr. Ivan Calandra
>> TraCEr, laboratory for Traceology and Controlled Experiments
>> MONREPOS Archaeological Research Centre and
>> Museum for Human Behavioural Evolution
>> Schloss Monrepos
>> 56567 Neuwied, Germany
>> +49 (0) 2631 9772-243
>> https://www.researchgate.net/profile/Ivan_Calandra
>>
>> On 17/09/2019 09:14, Ivan Krylov wrote:
>>> On Tue, 17 Sep 2019 08:48:43 +0200
>>> Ivan Calandra <calandra using rgzm.de> wrote:
>>>
>>>> CSVs <- list.files(path=..., pattern="\\.csv$")
>>>> w.files <- CSVs[grep(pattern="_w_", CSVs)]
>>>>
>>>> Of course, what I would like to do is list only the interesting
>> files
>>>> from the beginning, rather than subsetting the whole list of files.
>>> One way to express that would be "_w_.*\\.csv$", meaning that the
>>> filename has to have "_w_" in it, followed by anything (any character
>>> repeated any number of times, including 0), followed by ".csv" at the
>>> end of the line.
>>>
>>>> 2) The units of the variables are given in the original headers. I
>>>> would like to extract the units. This is what I did: headers <-
>>>> c("dist to origin on curve [mm]","segment on section [mm]", "angle 1
>>>> [degree]", "angle 2 [degree]","angle 3 [degree]") units.var <-
>>>> gsub(pattern="^.*\\[|\\]$", "", headers)
>>>>
>>>> It seems to be to overly complicated using gsub(). Isn't there a way
>>>> to extract what is interesting rather than deleting what is not?
>>> Pure-R way: use regmatches() + regexpr(). Both regmatches and regexpr
>>> take the character vector as an argument, so duplication is hard to
>>> avoid:
>>>
>>> units <- regmatches(headers, regexpr('\\[.*\\]', headers))
>>>
>>> The stringr package has an str_match() function with a nicer
>> interface:
>>> str_match(headers, '\\[.*\\]') -> units.
>>>
>>> Such "greedy" patterns containing ".*" present a few pitfalls, e.g.
>>> looking for text in parentheses using the pattern "\\(.*\\)" in
>>> "...(abc)...(def)..." will match the whole "(abc)...(def)" instead of
>>> single groups "(abc)" and "(def)", but with your examples the pattern
>>> should work as presented. One other option would be to ask for "[",
>>> followed by zero or more characters that are not "]", followed by
>> "]":
>>> '\\[[^]]*\\]'.
>>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.