[R] element wise pattern recognition and string substitution

Bert Gunter bgunter.4567 at gmail.com
Tue Sep 6 04:21:03 CEST 2016


Just noticed: My clumsy do.call() line in my previously posted code
below should be replaced with:
pat <- paste(pat,collapse = "|")


> pat <- c(pat1,pat2)
> paste(pat,collapse="|")
[1] "a+\\.*a+|b+\\.*b+"

************ replace this **************************
> pat <- do.call(paste,c(as.list(pat), sep="|"))
********************************************
> sub(paste0("^[^b]*(",pat,").*$"),"\\1",z)
[1] "a.a"   "bb"    "b.bbb"


-- Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Sep 5, 2016 at 12:11 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
> Jun:
>
> You need to provide a clear specification via regular expressions of
> the patterns you wish to match -- at least for me to decipher it.
> Others may be smarter than I, though...
>
> Jeff: Thanks. I have now convinced myself that it can be done (a
> "proof" of sorts): If pat1, pat2,..., patn are m different patterns
> (in a vector of patterns)  to be matched in a vector of n strings,
> where only one of the patterns will match in any string,  then use
> paste() (probably via do.call()) or otherwise to paste them together
> separated by "|" to form the concatenated pattern, pat. Then
>
> sub(paste0("^.*(",pat, ").*$"),"\\1",thevector)
>
> should extract the matching pattern in each (perhaps with a little
> fiddling due to precedence rules); e.g.
>
>> z <-c(".fg.h.g.a.a", "bb..dd.ef.tgf.", "foo...b.bbb.tgy")
>
>> pat1 <- "a+\\.*a+"
>> pat2 <-"b+\\.*b+"
>> pat <- c(pat1,pat2)
>
>> pat <- do.call(paste,c(as.list(pat), sep="|"))
>> pat
> [1] "a+\\.*a+|b+\\.*b+"
>
>> sub(paste0("^[^b]*(",pat,").*$"), "\\1", z)
> [1] "a.a"   "bb"    "b.bbb"
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Sep 5, 2016 at 9:56 AM, Jun Shen <jun.shen.ut at gmail.com> wrote:
>> Thanks for the reply, Bert.
>>
>> Your solution solves the example. I actually have a more general situation
>> where I have this dot concatenated string from multiple variables. The
>> problem is those variables may have values with dots in there. The number of
>> dots are not consistent for all values of a variable. So I am thinking to
>> define a vector of patterns for the vector of the string and hopefully to
>> find a way to use a pattern from the pattern vector for each value of the
>> string vector. The only way I can think of is "for" loop, which can be slow.
>> Also these are happening in a function I am writing. Just wonder if there is
>> another more efficient way. Thanks a lot.
>>
>> Jun
>>
>> On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>>>
>>> Well, he did provide an example, and...
>>>
>>>
>>> > z <- c('TX.WT.CUT.mean','mg.tx.cv')
>>>
>>> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z)
>>> [1] "WT.CUT" "tx"
>>>
>>>
>>> ## seems to do what was requested.
>>>
>>> Jeff would have to amplify on his initial statement however: do you
>>> mean that separate patterns can always be combined via "|" ?  Or
>>> something deeper?
>>>
>>> Cheers,
>>> Bert
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
>>> wrote:
>>> > Your opening assertion is false.
>>> >
>>> > Provide a reproducible example and someone will demonstrate.
>>> > --
>>> > Sent from my phone. Please excuse my brevity.
>>> >
>>> > On September 4, 2016 9:06:59 PM PDT, Jun Shen <jun.shen.ut at gmail.com>
>>> > wrote:
>>> >>Dear list,
>>> >>
>>> >>I have a vector of strings that cannot be described by one pattern. So
>>> >>let's say I construct a vector of patterns in the same length as the
>>> >>vector
>>> >>of strings, can I do the element wise pattern recognition and string
>>> >>substitution.
>>> >>
>>> >>For example,
>>> >>
>>> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)"
>>> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)"
>>> >>
>>> >>patterns <- c(pattern1,pattern2)
>>> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv')
>>> >>
>>> >>Say I want to extract "WT.CUT" from the first string and "tx" from the
>>> >>second string. If I do
>>> >>
>>> >>sub(patterns, '\\2', strings), only the first pattern will be used.
>>> >>
>>> >>looping the patterns doesn't work the way I want. Appreciate any
>>> >>comments.
>>> >>Thanks.
>>> >>
>>> >>Jun
>>> >>
>>> >>       [[alternative HTML version deleted]]
>>> >>
>>> >>______________________________________________
>>> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> >>https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>PLEASE do read the posting guide
>>> >>http://www.R-project.org/posting-guide.html
>>> >>and provide commented, minimal, self-contained, reproducible code.
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>



More information about the R-help mailing list