[R] element wise pattern recognition and string substitution

Ista Zahn istazahn at gmail.com
Wed Sep 7 15:34:44 CEST 2016


On Tue, Sep 6, 2016 at 11:59 PM, Jun Shen <jun.shen.ut at gmail.com> wrote:
> Hi Ista,
>
> Thanks for the suggestion. I didn't know mapply can be used this way! Let me
> take one more step. Instead of defining a pattern for each string, I would
> like to define a set of patterns from all the possible combination of the
> unique values of those variables. Then I need each string to find a pattern
> for itself.

Uh, humn, what?!? I have no idea what this means. Example?

--Ista

 I know this is getting a little stretching. Thanks for all the
> suggestion/comments from everyone.
>
> Jun
>
> On Tue, Sep 6, 2016 at 9:44 PM, Ista Zahn <istazahn at gmail.com> wrote:
>>
>> If you want to mach each element of 'strings' to a different regex, do
>> it. Here are three ways, using your original example.
>>
>> pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)"
>> pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)"
>>
>> patterns <- c(pattern1,pattern2)
>> strings <- c('TX.WT.CUT.mean','mg.tx.cv')
>>
>> for(i in seq(strings)) print(sub(patterns[i], "\\2", strings[i]))
>>
>> mapply(sub, pattern = patterns, x = strings, MoreArgs=list(replacement =
>> "\\2"))
>>
>> library(stringi)
>> stri_replace_all_regex(strings, patterns, "$2")
>>
>> Best,
>> Ista
>> On Tue, Sep 6, 2016 at 9:20 PM, Jun Shen <jun.shen.ut at gmail.com> wrote:
>> > Hi Jeff,
>> >
>> > Thanks for the reply. I tried your suggestion and it doesn't seem to
>> > work
>> > and I tried a simple pattern as follows and it works as expected
>> >
>> > sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\1',
>> > "3.mg.kg.>50-70.kg.P05")
>> > [1] "3.mg.kg"
>> >
>> > sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\2',
>> > "3.mg.kg.>50-70.kg.P05")
>> > [1] ">50-70.kg"
>> >
>> > sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\3',
>> > "3.mg.kg.>50-70.kg.P05")
>> > [1] "P05"
>> >
>> > My problem is the pattern has to be dynamically constructed on the input
>> > data of the function I am writing. It's actually not too difficult to
>> > assemble the final.pattern with some code like the following
>> >
>> > sort.var <- c('TX','WTCUT')
>> > combn.sort.var <- do.call(expand.grid, lapply(sort.var,
>> >
>> > function(x)paste('(',gsub('\\.','\\\\.',unlist(unique(all.exposure[x]))),
>> > ')', sep='')))
>> > all.patterns <- do.call(paste, c(combn.sort.var, '(.*)', sep='\\.'))
>> > final.pattern <- paste0(all.patterns, collapse='|')
>> >
>> > You cannot run the code directly since the data object "all.exposure" is
>> > not provided here.
>> >
>> > Jun
>> >
>> >
>> >
>> > On Tue, Sep 6, 2016 at 8:18 PM, Jeff Newmiller
>> > <jdnewmil at dcn.davis.ca.us>
>> > wrote:
>> >
>> >> I am not near my computer today, but each parenthesis gets its own
>> >> result
>> >> number, so you should put the parenthesis around the whole pattern of
>> >> alternatives instead of having many parentheses.
>> >>
>> >> I recommend thinking in terms of what common information you expect to
>> >> find in these various strings, and place your parentheses to capture
>> >> that
>> >> information. There is no other reason to put parentheses in the
>> >> pattern...
>> >> they are not grouping symbols.
>> >> --
>> >> Sent from my phone. Please excuse my brevity.
>> >>
>> >> On September 6, 2016 5:01:04 PM PDT, Bert Gunter
>> >> <bgunter.4567 at gmail.com>
>> >> wrote:
>> >> >Jun:
>> >> >
>> >> >1. Tell us your desired result from your test vector and maybe someone
>> >> >will help.
>> >> >
>> >> >2. As we played this game once already (you couldn't do it; I showed
>> >> >you how), this seems to be a function of your limitations with regular
>> >> >expressions. I'm probably not much better, but in any case, I don't
>> >> >intend to be your consultant. See if you can find someone locally to
>> >> >help you if you do not receive a satisfactory reply from the list.
>> >> >There are many people here who are pretty good at this sort of thing,
>> >> >but I don't know if they'll reply. Regex's are certainly complex. PERL
>> >> >people tend to be pretty good at them, I believe. There are numerous
>> >> >web sites and books on them if you need to acquire expertise for your
>> >> >work.
>> >> >
>> >> >Cheers,
>> >> >Bert
>> >> >Bert Gunter
>> >> >
>> >> >"The trouble with having an open mind is that people keep coming along
>> >> >and sticking things into it."
>> >> >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>> >> >
>> >> >
>> >> >On Tue, Sep 6, 2016 at 3:59 PM, Jun Shen <jun.shen.ut at gmail.com>
>> >> > wrote:
>> >> >> Hi Bert,
>> >> >>
>> >> >> I still couldn't make the multiple patterns to work. Here is an
>> >> >example. I
>> >> >> make the pattern as follows
>> >> >>
>> >> >> final.pattern <-
>> >> >>
>> >> >"(240\\.m\\.g)\\.(>50-70\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>
>> >> 50-70\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>70-90\\.kg)\\.(.*)|(3\\
>> >> .mg\\.kg)\\.(>70-90\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>90-110\\.
>> >> kg)\\.(.*)|(3\\.mg\\.kg)\\.(>90-110\\.kg)\\.(.*)|(240\\.m\\
>> >> .g)\\.(50\\.kg\\.or\\.less)\\.(.*)|(3\\.mg\\.kg)\\.(50\\.kg\
>> >> \.or\\.less)\\.(.*)|(240\\.m\\.g)\\.(>110\\.kg)\\.(.*)|(3\\.
>> >> mg\\.kg)\\.(>110\\.kg)\\.(.*)"
>> >> >>
>> >> >> test.string <- c('240.m.g.>110.kg.geo.mean', '3.mg.kg.>110.kg.P05',
>> >> >> '240.m.g.>50-70.kg.geo.mean')
>> >> >>
>> >> >> sub(final.pattern, '\\1', test.string)
>> >> >> sub(final.pattern, '\\2', test.string)
>> >> >> sub(final.pattern, '\\3', test.string)
>> >> >>
>> >> >> Only the third string has been correctly parsed, which matches the
>> >> >first
>> >> >> pattern. It seems the rest of the patterns are not called.
>> >> >>
>> >> >> Jun
>> >> >>
>> >> >>
>> >> >> On Mon, Sep 5, 2016 at 10:21 PM, Bert Gunter
>> >> >> <bgunter.4567 at gmail.com>
>> >> >wrote:
>> >> >>>
>> >> >>> Just noticed: My clumsy do.call() line in my previously posted code
>> >> >>> below should be replaced with:
>> >> >>> pat <- paste(pat,collapse = "|")
>> >> >>>
>> >> >>>
>> >> >>> > pat <- c(pat1,pat2)
>> >> >>> > paste(pat,collapse="|")
>> >> >>> [1] "a+\\.*a+|b+\\.*b+"
>> >> >>>
>> >> >>> ************ replace this **************************
>> >> >>> > pat <- do.call(paste,c(as.list(pat), sep="|"))
>> >> >>> ********************************************
>> >> >>> > sub(paste0("^[^b]*(",pat,").*$"),"\\1",z)
>> >> >>> [1] "a.a"   "bb"    "b.bbb"
>> >> >>>
>> >> >>>
>> >> >>> -- Bert
>> >> >>> Bert Gunter
>> >> >>>
>> >> >>> "The trouble with having an open mind is that people keep coming
>> >> >along
>> >> >>> and sticking things into it."
>> >> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>> >> >>>
>> >> >>>
>> >> >>> On Mon, Sep 5, 2016 at 12:11 PM, Bert Gunter
>> >> ><bgunter.4567 at gmail.com>
>> >> >>> wrote:
>> >> >>> > Jun:
>> >> >>> >
>> >> >>> > You need to provide a clear specification via regular expressions
>> >> >of
>> >> >>> > the patterns you wish to match -- at least for me to decipher it.
>> >> >>> > Others may be smarter than I, though...
>> >> >>> >
>> >> >>> > Jeff: Thanks. I have now convinced myself that it can be done (a
>> >> >>> > "proof" of sorts): If pat1, pat2,..., patn are m different
>> >> >patterns
>> >> >>> > (in a vector of patterns)  to be matched in a vector of n
>> >> >>> > strings,
>> >> >>> > where only one of the patterns will match in any string,  then
>> >> >>> > use
>> >> >>> > paste() (probably via do.call()) or otherwise to paste them
>> >> >together
>> >> >>> > separated by "|" to form the concatenated pattern, pat. Then
>> >> >>> >
>> >> >>> > sub(paste0("^.*(",pat, ").*$"),"\\1",thevector)
>> >> >>> >
>> >> >>> > should extract the matching pattern in each (perhaps with a
>> >> >>> > little
>> >> >>> > fiddling due to precedence rules); e.g.
>> >> >>> >
>> >> >>> >> z <-c(".fg.h.g.a.a", "bb..dd.ef.tgf.", "foo...b.bbb.tgy")
>> >> >>> >
>> >> >>> >> pat1 <- "a+\\.*a+"
>> >> >>> >> pat2 <-"b+\\.*b+"
>> >> >>> >> pat <- c(pat1,pat2)
>> >> >>> >
>> >> >>> >> pat <- do.call(paste,c(as.list(pat), sep="|"))
>> >> >>> >> pat
>> >> >>> > [1] "a+\\.*a+|b+\\.*b+"
>> >> >>> >
>> >> >>> >> sub(paste0("^[^b]*(",pat,").*$"), "\\1", z)
>> >> >>> > [1] "a.a"   "bb"    "b.bbb"
>> >> >>> >
>> >> >>> > Cheers,
>> >> >>> > Bert
>> >> >>> >
>> >> >>> >
>> >> >>> > Bert Gunter
>> >> >>> >
>> >> >>> > "The trouble with having an open mind is that people keep coming
>> >> >along
>> >> >>> > and sticking things into it."
>> >> >>> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
>> >> >>> > )
>> >> >>> >
>> >> >>> >
>> >> >>> > On Mon, Sep 5, 2016 at 9:56 AM, Jun Shen <jun.shen.ut at gmail.com>
>> >> >wrote:
>> >> >>> >> Thanks for the reply, Bert.
>> >> >>> >>
>> >> >>> >> Your solution solves the example. I actually have a more general
>> >> >>> >> situation
>> >> >>> >> where I have this dot concatenated string from multiple
>> >> >variables. The
>> >> >>> >> problem is those variables may have values with dots in there.
>> >> >The
>> >> >>> >> number of
>> >> >>> >> dots are not consistent for all values of a variable. So I am
>> >> >thinking
>> >> >>> >> to
>> >> >>> >> define a vector of patterns for the vector of the string and
>> >> >hopefully
>> >> >>> >> to
>> >> >>> >> find a way to use a pattern from the pattern vector for each
>> >> >value of
>> >> >>> >> the
>> >> >>> >> string vector. The only way I can think of is "for" loop, which
>> >> >can be
>> >> >>> >> slow.
>> >> >>> >> Also these are happening in a function I am writing. Just wonder
>> >> >if
>> >> >>> >> there is
>> >> >>> >> another more efficient way. Thanks a lot.
>> >> >>> >>
>> >> >>> >> Jun
>> >> >>> >>
>> >> >>> >> On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter
>> >> ><bgunter.4567 at gmail.com>
>> >> >>> >> wrote:
>> >> >>> >>>
>> >> >>> >>> Well, he did provide an example, and...
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> > z <- c('TX.WT.CUT.mean','mg.tx.cv')
>> >> >>> >>>
>> >> >>> >>> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z)
>> >> >>> >>> [1] "WT.CUT" "tx"
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> ## seems to do what was requested.
>> >> >>> >>>
>> >> >>> >>> Jeff would have to amplify on his initial statement however: do
>> >> >you
>> >> >>> >>> mean that separate patterns can always be combined via "|" ?
>> >> >>> >>> Or
>> >> >>> >>> something deeper?
>> >> >>> >>>
>> >> >>> >>> Cheers,
>> >> >>> >>> Bert
>> >> >>> >>> Bert Gunter
>> >> >>> >>>
>> >> >>> >>> "The trouble with having an open mind is that people keep
>> >> >>> >>> coming
>> >> >along
>> >> >>> >>> and sticking things into it."
>> >> >>> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
>> >> >>> >>> strip
>> >> >)
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller
>> >> >>> >>> <jdnewmil at dcn.davis.ca.us>
>> >> >>> >>> wrote:
>> >> >>> >>> > Your opening assertion is false.
>> >> >>> >>> >
>> >> >>> >>> > Provide a reproducible example and someone will demonstrate.
>> >> >>> >>> > --
>> >> >>> >>> > Sent from my phone. Please excuse my brevity.
>> >> >>> >>> >
>> >> >>> >>> > On September 4, 2016 9:06:59 PM PDT, Jun Shen
>> >> >>> >>> > <jun.shen.ut at gmail.com>
>> >> >>> >>> > wrote:
>> >> >>> >>> >>Dear list,
>> >> >>> >>> >>
>> >> >>> >>> >>I have a vector of strings that cannot be described by one
>> >> >pattern.
>> >> >>> >>> >> So
>> >> >>> >>> >>let's say I construct a vector of patterns in the same length
>> >> >as the
>> >> >>> >>> >>vector
>> >> >>> >>> >>of strings, can I do the element wise pattern recognition and
>> >> >string
>> >> >>> >>> >>substitution.
>> >> >>> >>> >>
>> >> >>> >>> >>For example,
>> >> >>> >>> >>
>> >> >>> >>> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)"
>> >> >>> >>> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)"
>> >> >>> >>> >>
>> >> >>> >>> >>patterns <- c(pattern1,pattern2)
>> >> >>> >>> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv')
>> >> >>> >>> >>
>> >> >>> >>> >>Say I want to extract "WT.CUT" from the first string and "tx"
>> >> >from
>> >> >>> >>> >> the
>> >> >>> >>> >>second string. If I do
>> >> >>> >>> >>
>> >> >>> >>> >>sub(patterns, '\\2', strings), only the first pattern will be
>> >> >used.
>> >> >>> >>> >>
>> >> >>> >>> >>looping the patterns doesn't work the way I want. Appreciate
>> >> >any
>> >> >>> >>> >>comments.
>> >> >>> >>> >>Thanks.
>> >> >>> >>> >>
>> >> >>> >>> >>Jun
>> >> >>> >>> >>
>> >> >>> >>> >>       [[alternative HTML version deleted]]
>> >> >>> >>> >>
>> >> >>> >>> >>______________________________________________
>> >> >>> >>> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>> >> >see
>> >> >>> >>> >>https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>> >>> >>PLEASE do read the posting guide
>> >> >>> >>> >>http://www.R-project.org/posting-guide.html
>> >> >>> >>> >>and provide commented, minimal, self-contained, reproducible
>> >> >code.
>> >> >>> >>> >
>> >> >>> >>> > ______________________________________________
>> >> >>> >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>> >> >see
>> >> >>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>> >>> > PLEASE do read the posting guide
>> >> >>> >>> > http://www.R-project.org/posting-guide.html
>> >> >>> >>> > and provide commented, minimal, self-contained, reproducible
>> >> >code.
>> >> >>> >>
>> >> >>> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list