[R] element wise pattern recognition and string substitution

Bert Gunter bgunter.4567 at gmail.com
Mon Sep 5 21:11:06 CEST 2016


Jun:

You need to provide a clear specification via regular expressions of
the patterns you wish to match -- at least for me to decipher it.
Others may be smarter than I, though...

Jeff: Thanks. I have now convinced myself that it can be done (a
"proof" of sorts): If pat1, pat2,..., patn are m different patterns
(in a vector of patterns)  to be matched in a vector of n strings,
where only one of the patterns will match in any string,  then use
paste() (probably via do.call()) or otherwise to paste them together
separated by "|" to form the concatenated pattern, pat. Then

sub(paste0("^.*(",pat, ").*$"),"\\1",thevector)

should extract the matching pattern in each (perhaps with a little
fiddling due to precedence rules); e.g.

> z <-c(".fg.h.g.a.a", "bb..dd.ef.tgf.", "foo...b.bbb.tgy")

> pat1 <- "a+\\.*a+"
> pat2 <-"b+\\.*b+"
> pat <- c(pat1,pat2)

> pat <- do.call(paste,c(as.list(pat), sep="|"))
> pat
[1] "a+\\.*a+|b+\\.*b+"

> sub(paste0("^[^b]*(",pat,").*$"), "\\1", z)
[1] "a.a"   "bb"    "b.bbb"

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Sep 5, 2016 at 9:56 AM, Jun Shen <jun.shen.ut at gmail.com> wrote:
> Thanks for the reply, Bert.
>
> Your solution solves the example. I actually have a more general situation
> where I have this dot concatenated string from multiple variables. The
> problem is those variables may have values with dots in there. The number of
> dots are not consistent for all values of a variable. So I am thinking to
> define a vector of patterns for the vector of the string and hopefully to
> find a way to use a pattern from the pattern vector for each value of the
> string vector. The only way I can think of is "for" loop, which can be slow.
> Also these are happening in a function I am writing. Just wonder if there is
> another more efficient way. Thanks a lot.
>
> Jun
>
> On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>>
>> Well, he did provide an example, and...
>>
>>
>> > z <- c('TX.WT.CUT.mean','mg.tx.cv')
>>
>> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z)
>> [1] "WT.CUT" "tx"
>>
>>
>> ## seems to do what was requested.
>>
>> Jeff would have to amplify on his initial statement however: do you
>> mean that separate patterns can always be combined via "|" ?  Or
>> something deeper?
>>
>> Cheers,
>> Bert
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
>> wrote:
>> > Your opening assertion is false.
>> >
>> > Provide a reproducible example and someone will demonstrate.
>> > --
>> > Sent from my phone. Please excuse my brevity.
>> >
>> > On September 4, 2016 9:06:59 PM PDT, Jun Shen <jun.shen.ut at gmail.com>
>> > wrote:
>> >>Dear list,
>> >>
>> >>I have a vector of strings that cannot be described by one pattern. So
>> >>let's say I construct a vector of patterns in the same length as the
>> >>vector
>> >>of strings, can I do the element wise pattern recognition and string
>> >>substitution.
>> >>
>> >>For example,
>> >>
>> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)"
>> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)"
>> >>
>> >>patterns <- c(pattern1,pattern2)
>> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv')
>> >>
>> >>Say I want to extract "WT.CUT" from the first string and "tx" from the
>> >>second string. If I do
>> >>
>> >>sub(patterns, '\\2', strings), only the first pattern will be used.
>> >>
>> >>looping the patterns doesn't work the way I want. Appreciate any
>> >>comments.
>> >>Thanks.
>> >>
>> >>Jun
>> >>
>> >>       [[alternative HTML version deleted]]
>> >>
>> >>______________________________________________
>> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>https://stat.ethz.ch/mailman/listinfo/r-help
>> >>PLEASE do read the posting guide
>> >>http://www.R-project.org/posting-guide.html
>> >>and provide commented, minimal, self-contained, reproducible code.
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list