[R] element wise pattern recognition and string substitution

Jun Shen jun.shen.ut at gmail.com
Wed Sep 7 03:20:41 CEST 2016


Hi Jeff,

Thanks for the reply. I tried your suggestion and it doesn't seem to work
and I tried a simple pattern as follows and it works as expected

sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\1', "3.mg.kg.>50-70.kg.P05")
[1] "3.mg.kg"

sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\2', "3.mg.kg.>50-70.kg.P05")
[1] ">50-70.kg"

sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\3', "3.mg.kg.>50-70.kg.P05")
[1] "P05"

My problem is the pattern has to be dynamically constructed on the input
data of the function I am writing. It's actually not too difficult to
assemble the final.pattern with some code like the following

sort.var <- c('TX','WTCUT')
combn.sort.var <- do.call(expand.grid, lapply(sort.var,
function(x)paste('(',gsub('\\.','\\\\.',unlist(unique(all.exposure[x]))),
')', sep='')))
all.patterns <- do.call(paste, c(combn.sort.var, '(.*)', sep='\\.'))
final.pattern <- paste0(all.patterns, collapse='|')

You cannot run the code directly since the data object "all.exposure" is
not provided here.

Jun



On Tue, Sep 6, 2016 at 8:18 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
wrote:

> I am not near my computer today, but each parenthesis gets its own result
> number, so you should put the parenthesis around the whole pattern of
> alternatives instead of having many parentheses.
>
> I recommend thinking in terms of what common information you expect to
> find in these various strings, and place your parentheses to capture that
> information. There is no other reason to put parentheses in the pattern...
> they are not grouping symbols.
> --
> Sent from my phone. Please excuse my brevity.
>
> On September 6, 2016 5:01:04 PM PDT, Bert Gunter <bgunter.4567 at gmail.com>
> wrote:
> >Jun:
> >
> >1. Tell us your desired result from your test vector and maybe someone
> >will help.
> >
> >2. As we played this game once already (you couldn't do it; I showed
> >you how), this seems to be a function of your limitations with regular
> >expressions. I'm probably not much better, but in any case, I don't
> >intend to be your consultant. See if you can find someone locally to
> >help you if you do not receive a satisfactory reply from the list.
> >There are many people here who are pretty good at this sort of thing,
> >but I don't know if they'll reply. Regex's are certainly complex. PERL
> >people tend to be pretty good at them, I believe. There are numerous
> >web sites and books on them if you need to acquire expertise for your
> >work.
> >
> >Cheers,
> >Bert
> >Bert Gunter
> >
> >"The trouble with having an open mind is that people keep coming along
> >and sticking things into it."
> >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> >On Tue, Sep 6, 2016 at 3:59 PM, Jun Shen <jun.shen.ut at gmail.com> wrote:
> >> Hi Bert,
> >>
> >> I still couldn't make the multiple patterns to work. Here is an
> >example. I
> >> make the pattern as follows
> >>
> >> final.pattern <-
> >>
> >"(240\\.m\\.g)\\.(>50-70\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>
> 50-70\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>70-90\\.kg)\\.(.*)|(3\\
> .mg\\.kg)\\.(>70-90\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>90-110\\.
> kg)\\.(.*)|(3\\.mg\\.kg)\\.(>90-110\\.kg)\\.(.*)|(240\\.m\\
> .g)\\.(50\\.kg\\.or\\.less)\\.(.*)|(3\\.mg\\.kg)\\.(50\\.kg\
> \.or\\.less)\\.(.*)|(240\\.m\\.g)\\.(>110\\.kg)\\.(.*)|(3\\.
> mg\\.kg)\\.(>110\\.kg)\\.(.*)"
> >>
> >> test.string <- c('240.m.g.>110.kg.geo.mean', '3.mg.kg.>110.kg.P05',
> >> '240.m.g.>50-70.kg.geo.mean')
> >>
> >> sub(final.pattern, '\\1', test.string)
> >> sub(final.pattern, '\\2', test.string)
> >> sub(final.pattern, '\\3', test.string)
> >>
> >> Only the third string has been correctly parsed, which matches the
> >first
> >> pattern. It seems the rest of the patterns are not called.
> >>
> >> Jun
> >>
> >>
> >> On Mon, Sep 5, 2016 at 10:21 PM, Bert Gunter <bgunter.4567 at gmail.com>
> >wrote:
> >>>
> >>> Just noticed: My clumsy do.call() line in my previously posted code
> >>> below should be replaced with:
> >>> pat <- paste(pat,collapse = "|")
> >>>
> >>>
> >>> > pat <- c(pat1,pat2)
> >>> > paste(pat,collapse="|")
> >>> [1] "a+\\.*a+|b+\\.*b+"
> >>>
> >>> ************ replace this **************************
> >>> > pat <- do.call(paste,c(as.list(pat), sep="|"))
> >>> ********************************************
> >>> > sub(paste0("^[^b]*(",pat,").*$"),"\\1",z)
> >>> [1] "a.a"   "bb"    "b.bbb"
> >>>
> >>>
> >>> -- Bert
> >>> Bert Gunter
> >>>
> >>> "The trouble with having an open mind is that people keep coming
> >along
> >>> and sticking things into it."
> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>
> >>>
> >>> On Mon, Sep 5, 2016 at 12:11 PM, Bert Gunter
> ><bgunter.4567 at gmail.com>
> >>> wrote:
> >>> > Jun:
> >>> >
> >>> > You need to provide a clear specification via regular expressions
> >of
> >>> > the patterns you wish to match -- at least for me to decipher it.
> >>> > Others may be smarter than I, though...
> >>> >
> >>> > Jeff: Thanks. I have now convinced myself that it can be done (a
> >>> > "proof" of sorts): If pat1, pat2,..., patn are m different
> >patterns
> >>> > (in a vector of patterns)  to be matched in a vector of n strings,
> >>> > where only one of the patterns will match in any string,  then use
> >>> > paste() (probably via do.call()) or otherwise to paste them
> >together
> >>> > separated by "|" to form the concatenated pattern, pat. Then
> >>> >
> >>> > sub(paste0("^.*(",pat, ").*$"),"\\1",thevector)
> >>> >
> >>> > should extract the matching pattern in each (perhaps with a little
> >>> > fiddling due to precedence rules); e.g.
> >>> >
> >>> >> z <-c(".fg.h.g.a.a", "bb..dd.ef.tgf.", "foo...b.bbb.tgy")
> >>> >
> >>> >> pat1 <- "a+\\.*a+"
> >>> >> pat2 <-"b+\\.*b+"
> >>> >> pat <- c(pat1,pat2)
> >>> >
> >>> >> pat <- do.call(paste,c(as.list(pat), sep="|"))
> >>> >> pat
> >>> > [1] "a+\\.*a+|b+\\.*b+"
> >>> >
> >>> >> sub(paste0("^[^b]*(",pat,").*$"), "\\1", z)
> >>> > [1] "a.a"   "bb"    "b.bbb"
> >>> >
> >>> > Cheers,
> >>> > Bert
> >>> >
> >>> >
> >>> > Bert Gunter
> >>> >
> >>> > "The trouble with having an open mind is that people keep coming
> >along
> >>> > and sticking things into it."
> >>> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>> >
> >>> >
> >>> > On Mon, Sep 5, 2016 at 9:56 AM, Jun Shen <jun.shen.ut at gmail.com>
> >wrote:
> >>> >> Thanks for the reply, Bert.
> >>> >>
> >>> >> Your solution solves the example. I actually have a more general
> >>> >> situation
> >>> >> where I have this dot concatenated string from multiple
> >variables. The
> >>> >> problem is those variables may have values with dots in there.
> >The
> >>> >> number of
> >>> >> dots are not consistent for all values of a variable. So I am
> >thinking
> >>> >> to
> >>> >> define a vector of patterns for the vector of the string and
> >hopefully
> >>> >> to
> >>> >> find a way to use a pattern from the pattern vector for each
> >value of
> >>> >> the
> >>> >> string vector. The only way I can think of is "for" loop, which
> >can be
> >>> >> slow.
> >>> >> Also these are happening in a function I am writing. Just wonder
> >if
> >>> >> there is
> >>> >> another more efficient way. Thanks a lot.
> >>> >>
> >>> >> Jun
> >>> >>
> >>> >> On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter
> ><bgunter.4567 at gmail.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> Well, he did provide an example, and...
> >>> >>>
> >>> >>>
> >>> >>> > z <- c('TX.WT.CUT.mean','mg.tx.cv')
> >>> >>>
> >>> >>> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z)
> >>> >>> [1] "WT.CUT" "tx"
> >>> >>>
> >>> >>>
> >>> >>> ## seems to do what was requested.
> >>> >>>
> >>> >>> Jeff would have to amplify on his initial statement however: do
> >you
> >>> >>> mean that separate patterns can always be combined via "|" ?  Or
> >>> >>> something deeper?
> >>> >>>
> >>> >>> Cheers,
> >>> >>> Bert
> >>> >>> Bert Gunter
> >>> >>>
> >>> >>> "The trouble with having an open mind is that people keep coming
> >along
> >>> >>> and sticking things into it."
> >>> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
> >)
> >>> >>>
> >>> >>>
> >>> >>> On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller
> >>> >>> <jdnewmil at dcn.davis.ca.us>
> >>> >>> wrote:
> >>> >>> > Your opening assertion is false.
> >>> >>> >
> >>> >>> > Provide a reproducible example and someone will demonstrate.
> >>> >>> > --
> >>> >>> > Sent from my phone. Please excuse my brevity.
> >>> >>> >
> >>> >>> > On September 4, 2016 9:06:59 PM PDT, Jun Shen
> >>> >>> > <jun.shen.ut at gmail.com>
> >>> >>> > wrote:
> >>> >>> >>Dear list,
> >>> >>> >>
> >>> >>> >>I have a vector of strings that cannot be described by one
> >pattern.
> >>> >>> >> So
> >>> >>> >>let's say I construct a vector of patterns in the same length
> >as the
> >>> >>> >>vector
> >>> >>> >>of strings, can I do the element wise pattern recognition and
> >string
> >>> >>> >>substitution.
> >>> >>> >>
> >>> >>> >>For example,
> >>> >>> >>
> >>> >>> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)"
> >>> >>> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)"
> >>> >>> >>
> >>> >>> >>patterns <- c(pattern1,pattern2)
> >>> >>> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv')
> >>> >>> >>
> >>> >>> >>Say I want to extract "WT.CUT" from the first string and "tx"
> >from
> >>> >>> >> the
> >>> >>> >>second string. If I do
> >>> >>> >>
> >>> >>> >>sub(patterns, '\\2', strings), only the first pattern will be
> >used.
> >>> >>> >>
> >>> >>> >>looping the patterns doesn't work the way I want. Appreciate
> >any
> >>> >>> >>comments.
> >>> >>> >>Thanks.
> >>> >>> >>
> >>> >>> >>Jun
> >>> >>> >>
> >>> >>> >>       [[alternative HTML version deleted]]
> >>> >>> >>
> >>> >>> >>______________________________________________
> >>> >>> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
> >see
> >>> >>> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >>> >>> >>PLEASE do read the posting guide
> >>> >>> >>http://www.R-project.org/posting-guide.html
> >>> >>> >>and provide commented, minimal, self-contained, reproducible
> >code.
> >>> >>> >
> >>> >>> > ______________________________________________
> >>> >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
> >see
> >>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
> >>> >>> > PLEASE do read the posting guide
> >>> >>> > http://www.R-project.org/posting-guide.html
> >>> >>> > and provide commented, minimal, self-contained, reproducible
> >code.
> >>> >>
> >>> >>
> >>
> >>
>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list