[R] element wise pattern recognition and string substitution

Jun Shen jun.shen.ut at gmail.com
Wed Sep 7 02:57:16 CEST 2016


Hi Bert,

In the final.pattern, there are ten patterns.

>sub(final.pattern, '\\1', test.string)
Expected results: "240.m.g" "3.mg.kg" "240.m.g"
Current results: "" "" "240.m.g"

>sub(final.pattern, '\\2', test.string)
Expected results: ">110.kg" ">110.kg" ">50-70.kg"
Current results: "" "" ">50-70.kg"

>sub(final.pattern, '\\3', test.string)
Expected results: "geo.mean" "P05" "geo.mean"
Current results: "" "" "geo.mean"

Right now, I only get the results from the third string.


On Tue, Sep 6, 2016 at 8:01 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:

> Jun:
>
> 1. Tell us your desired result from your test vector and maybe someone
> will help.
>
> 2. As we played this game once already (you couldn't do it; I showed
> you how), this seems to be a function of your limitations with regular
> expressions. I'm probably not much better, but in any case, I don't
> intend to be your consultant. See if you can find someone locally to
> help you if you do not receive a satisfactory reply from the list.
> There are many people here who are pretty good at this sort of thing,
> but I don't know if they'll reply. Regex's are certainly complex. PERL
> people tend to be pretty good at them, I believe. There are numerous
> web sites and books on them if you need to acquire expertise for your
> work.
>
> Cheers,
> Bert
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Tue, Sep 6, 2016 at 3:59 PM, Jun Shen <jun.shen.ut at gmail.com> wrote:
> > Hi Bert,
> >
> > I still couldn't make the multiple patterns to work. Here is an example.
> I
> > make the pattern as follows
> >
> > final.pattern <-
> > "(240\\.m\\.g)\\.(>50-70\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>50-
> 70\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>70-90\\.kg)\\.(.*)|(3\\.
> mg\\.kg)\\.(>70-90\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>90-110\\.
> kg)\\.(.*)|(3\\.mg\\.kg)\\.(>90-110\\.kg)\\.(.*)|(240\\.m\\
> .g)\\.(50\\.kg\\.or\\.less)\\.(.*)|(3\\.mg\\.kg)\\.(50\\.kg\
> \.or\\.less)\\.(.*)|(240\\.m\\.g)\\.(>110\\.kg)\\.(.*)|(3\\.
> mg\\.kg)\\.(>110\\.kg)\\.(.*)"
> >
> > test.string <- c('240.m.g.>110.kg.geo.mean', '3.mg.kg.>110.kg.P05',
> > '240.m.g.>50-70.kg.geo.mean')
> >
> > sub(final.pattern, '\\1', test.string)
> > sub(final.pattern, '\\2', test.string)
> > sub(final.pattern, '\\3', test.string)
> >
> > Only the third string has been correctly parsed, which matches the first
> > pattern. It seems the rest of the patterns are not called.
> >
> > Jun
> >
> >
> > On Mon, Sep 5, 2016 at 10:21 PM, Bert Gunter <bgunter.4567 at gmail.com>
> wrote:
> >>
> >> Just noticed: My clumsy do.call() line in my previously posted code
> >> below should be replaced with:
> >> pat <- paste(pat,collapse = "|")
> >>
> >>
> >> > pat <- c(pat1,pat2)
> >> > paste(pat,collapse="|")
> >> [1] "a+\\.*a+|b+\\.*b+"
> >>
> >> ************ replace this **************************
> >> > pat <- do.call(paste,c(as.list(pat), sep="|"))
> >> ********************************************
> >> > sub(paste0("^[^b]*(",pat,").*$"),"\\1",z)
> >> [1] "a.a"   "bb"    "b.bbb"
> >>
> >>
> >> -- Bert
> >> Bert Gunter
> >>
> >> "The trouble with having an open mind is that people keep coming along
> >> and sticking things into it."
> >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>
> >>
> >> On Mon, Sep 5, 2016 at 12:11 PM, Bert Gunter <bgunter.4567 at gmail.com>
> >> wrote:
> >> > Jun:
> >> >
> >> > You need to provide a clear specification via regular expressions of
> >> > the patterns you wish to match -- at least for me to decipher it.
> >> > Others may be smarter than I, though...
> >> >
> >> > Jeff: Thanks. I have now convinced myself that it can be done (a
> >> > "proof" of sorts): If pat1, pat2,..., patn are m different patterns
> >> > (in a vector of patterns)  to be matched in a vector of n strings,
> >> > where only one of the patterns will match in any string,  then use
> >> > paste() (probably via do.call()) or otherwise to paste them together
> >> > separated by "|" to form the concatenated pattern, pat. Then
> >> >
> >> > sub(paste0("^.*(",pat, ").*$"),"\\1",thevector)
> >> >
> >> > should extract the matching pattern in each (perhaps with a little
> >> > fiddling due to precedence rules); e.g.
> >> >
> >> >> z <-c(".fg.h.g.a.a", "bb..dd.ef.tgf.", "foo...b.bbb.tgy")
> >> >
> >> >> pat1 <- "a+\\.*a+"
> >> >> pat2 <-"b+\\.*b+"
> >> >> pat <- c(pat1,pat2)
> >> >
> >> >> pat <- do.call(paste,c(as.list(pat), sep="|"))
> >> >> pat
> >> > [1] "a+\\.*a+|b+\\.*b+"
> >> >
> >> >> sub(paste0("^[^b]*(",pat,").*$"), "\\1", z)
> >> > [1] "a.a"   "bb"    "b.bbb"
> >> >
> >> > Cheers,
> >> > Bert
> >> >
> >> >
> >> > Bert Gunter
> >> >
> >> > "The trouble with having an open mind is that people keep coming along
> >> > and sticking things into it."
> >> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >> >
> >> >
> >> > On Mon, Sep 5, 2016 at 9:56 AM, Jun Shen <jun.shen.ut at gmail.com>
> wrote:
> >> >> Thanks for the reply, Bert.
> >> >>
> >> >> Your solution solves the example. I actually have a more general
> >> >> situation
> >> >> where I have this dot concatenated string from multiple variables.
> The
> >> >> problem is those variables may have values with dots in there. The
> >> >> number of
> >> >> dots are not consistent for all values of a variable. So I am
> thinking
> >> >> to
> >> >> define a vector of patterns for the vector of the string and
> hopefully
> >> >> to
> >> >> find a way to use a pattern from the pattern vector for each value of
> >> >> the
> >> >> string vector. The only way I can think of is "for" loop, which can
> be
> >> >> slow.
> >> >> Also these are happening in a function I am writing. Just wonder if
> >> >> there is
> >> >> another more efficient way. Thanks a lot.
> >> >>
> >> >> Jun
> >> >>
> >> >> On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter <bgunter.4567 at gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Well, he did provide an example, and...
> >> >>>
> >> >>>
> >> >>> > z <- c('TX.WT.CUT.mean','mg.tx.cv')
> >> >>>
> >> >>> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z)
> >> >>> [1] "WT.CUT" "tx"
> >> >>>
> >> >>>
> >> >>> ## seems to do what was requested.
> >> >>>
> >> >>> Jeff would have to amplify on his initial statement however: do you
> >> >>> mean that separate patterns can always be combined via "|" ?  Or
> >> >>> something deeper?
> >> >>>
> >> >>> Cheers,
> >> >>> Bert
> >> >>> Bert Gunter
> >> >>>
> >> >>> "The trouble with having an open mind is that people keep coming
> along
> >> >>> and sticking things into it."
> >> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >> >>>
> >> >>>
> >> >>> On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller
> >> >>> <jdnewmil at dcn.davis.ca.us>
> >> >>> wrote:
> >> >>> > Your opening assertion is false.
> >> >>> >
> >> >>> > Provide a reproducible example and someone will demonstrate.
> >> >>> > --
> >> >>> > Sent from my phone. Please excuse my brevity.
> >> >>> >
> >> >>> > On September 4, 2016 9:06:59 PM PDT, Jun Shen
> >> >>> > <jun.shen.ut at gmail.com>
> >> >>> > wrote:
> >> >>> >>Dear list,
> >> >>> >>
> >> >>> >>I have a vector of strings that cannot be described by one
> pattern.
> >> >>> >> So
> >> >>> >>let's say I construct a vector of patterns in the same length as
> the
> >> >>> >>vector
> >> >>> >>of strings, can I do the element wise pattern recognition and
> string
> >> >>> >>substitution.
> >> >>> >>
> >> >>> >>For example,
> >> >>> >>
> >> >>> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)"
> >> >>> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)"
> >> >>> >>
> >> >>> >>patterns <- c(pattern1,pattern2)
> >> >>> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv')
> >> >>> >>
> >> >>> >>Say I want to extract "WT.CUT" from the first string and "tx" from
> >> >>> >> the
> >> >>> >>second string. If I do
> >> >>> >>
> >> >>> >>sub(patterns, '\\2', strings), only the first pattern will be
> used.
> >> >>> >>
> >> >>> >>looping the patterns doesn't work the way I want. Appreciate any
> >> >>> >>comments.
> >> >>> >>Thanks.
> >> >>> >>
> >> >>> >>Jun
> >> >>> >>
> >> >>> >>       [[alternative HTML version deleted]]
> >> >>> >>
> >> >>> >>______________________________________________
> >> >>> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >>> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> >>PLEASE do read the posting guide
> >> >>> >>http://www.R-project.org/posting-guide.html
> >> >>> >>and provide commented, minimal, self-contained, reproducible code.
> >> >>> >
> >> >>> > ______________________________________________
> >> >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> > PLEASE do read the posting guide
> >> >>> > http://www.R-project.org/posting-guide.html
> >> >>> > and provide commented, minimal, self-contained, reproducible code.
> >> >>
> >> >>
> >
> >
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list