[R] element wise pattern recognition and string substitution

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Wed Sep 7 09:04:09 CEST 2016


Here are some suggestions:

test.string <- c( '240.m.g.>110.kg.geo.mean'
                 , '3.mg.kg.>110.kg.P05'
                 , '240.m.g.>50-70.kg.geo.mean'
                 )
# based on your literal idea
suggested.pattern1 <-
   "(240\\.m\\.g|3\\.mg\\.kg)\\.(>50-70\\.kg|>70-90\\.kg|>90-110\\.kg|50\\.kg\\.or\\.less|>110\\.kg)\\.(.*)"

resultL <- strsplit( sub( suggested.pattern1
                         , "\\1\t\\2\t\\3"
                         , test.string )
                    , split = "\t"
                    )

# equivalent based on apparent repetitive patterns in your sample data
suggested.pattern2 <- "(.*?m\\.g|kg)\\.(.*?kg|.*?less)\\.(.*)"

resultL2 <- strsplit( sub( suggested.pattern2
                          , "\\1\t\\2\t\\3"
                          , test.string
                          )
                     , split = "\t"
                     )

# put results into an organized table
DF <- setNames( data.frame( do.call( rbind, resultL ) )
               , c( "First", "Second", "Third" )
               )

By the way... please aim to make your examples reproducible. It would have 
been easy for you to define the necessary variables in example form
rather than sending a non-reproducible example.

On Tue, 6 Sep 2016, Jun Shen wrote:

> Hi Jeff,
> 
> Thanks for the reply. I tried your suggestion and it doesn't seem to work and I tried a simple pattern as follows and it works as expected
> 
> sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\1', "3.mg.kg.>50-70.kg.P05")
> [1] "3.mg.kg"
> 
> sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\2', "3.mg.kg.>50-70.kg.P05")
> [1] ">50-70.kg"
> 
> sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\3', "3.mg.kg.>50-70.kg.P05")
> [1] "P05"
> 
> My problem is the pattern has to be dynamically constructed on the input data of the function I am writing. It's actually not too difficult
> to assemble the final.pattern with some code like the following
> 
> sort.var <- c('TX','WTCUT')
> combn.sort.var <- do.call(expand.grid, lapply(sort.var, function(x)paste('(',gsub('\\.','\\\\.',unlist(unique(all.exposure[x]))), ')',
> sep='')))
> all.patterns <- do.call(paste, c(combn.sort.var, '(.*)', sep='\\.'))
> final.pattern <- paste0(all.patterns, collapse='|')
> 
> You cannot run the code directly since the data object "all.exposure" is not provided here.
> 
> Jun
> 
> 
> 
> On Tue, Sep 6, 2016 at 8:18 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
>       I am not near my computer today, but each parenthesis gets its own result number, so you should put the parenthesis around the
>       whole pattern of alternatives instead of having many parentheses.
>
>       I recommend thinking in terms of what common information you expect to find in these various strings, and place your parentheses
>       to capture that information. There is no other reason to put parentheses in the pattern... they are not grouping symbols.
>       --
>       Sent from my phone. Please excuse my brevity.
>
>       On September 6, 2016 5:01:04 PM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>       >Jun:
>       >
>       >1. Tell us your desired result from your test vector and maybe someone
>       >will help.
>       >
>       >2. As we played this game once already (you couldn't do it; I showed
>       >you how), this seems to be a function of your limitations with regular
>       >expressions. I'm probably not much better, but in any case, I don't
>       >intend to be your consultant. See if you can find someone locally to
>       >help you if you do not receive a satisfactory reply from the list.
>       >There are many people here who are pretty good at this sort of thing,
>       >but I don't know if they'll reply. Regex's are certainly complex. PERL
>       >people tend to be pretty good at them, I believe. There are numerous
>       >web sites and books on them if you need to acquire expertise for your
>       >work.
>       >
>       >Cheers,
>       >Bert
>       >Bert Gunter
>       >
>       >"The trouble with having an open mind is that people keep coming along
>       >and sticking things into it."
>       >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>       >
>       >
>       >On Tue, Sep 6, 2016 at 3:59 PM, Jun Shen <jun.shen.ut at gmail.com> wrote:
>       >> Hi Bert,
>       >>
>       >> I still couldn't make the multiple patterns to work. Here is an
>       >example. I
>       >> make the pattern as follows
>       >>
>       >> final.pattern <-
>       >>
> >"(240\\.m\\.g)\\.(>50-70\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>70-90\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>70-90\\.k
> g)\\.(.*)|(240\\.m\\.g)\\.(>90-110\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>90-110\\.kg)\\.(.*)|(240\\.m\\.g)\\.(50\\.kg\\.or\\.less)\\.(.*)|(3\\.mg\\
>       .kg)\\.(50\\.kg\\.or\\.less)\\.(.*)|(240\\.m\\.g)\\.(>110\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>110\\.kg)\\.(.*)"
>       >>
>       >> test.string <- c('240.m.g.>110.kg.geo.mean', '3.mg.kg.>110.kg.P05',
>       >> '240.m.g.>50-70.kg.geo.mean')
>       >>
>       >> sub(final.pattern, '\\1', test.string)
>       >> sub(final.pattern, '\\2', test.string)
>       >> sub(final.pattern, '\\3', test.string)
>       >>
>       >> Only the third string has been correctly parsed, which matches the
>       >first
>       >> pattern. It seems the rest of the patterns are not called.
>       >>
>       >> Jun
>       >>
>       >>
>       >> On Mon, Sep 5, 2016 at 10:21 PM, Bert Gunter <bgunter.4567 at gmail.com>
>       >wrote:
>       >>>
>       >>> Just noticed: My clumsy do.call() line in my previously posted code
>       >>> below should be replaced with:
>       >>> pat <- paste(pat,collapse = "|")
>       >>>
>       >>>
>       >>> > pat <- c(pat1,pat2)
>       >>> > paste(pat,collapse="|")
>       >>> [1] "a+\\.*a+|b+\\.*b+"
>       >>>
>       >>> ************ replace this **************************
>       >>> > pat <- do.call(paste,c(as.list(pat), sep="|"))
>       >>> ********************************************
>       >>> > sub(paste0("^[^b]*(",pat,").*$"),"\\1",z)
>       >>> [1] "a.a"   "bb"    "b.bbb"
>       >>>
>       >>>
>       >>> -- Bert
>       >>> Bert Gunter
>       >>>
>       >>> "The trouble with having an open mind is that people keep coming
>       >along
>       >>> and sticking things into it."
>       >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>       >>>
>       >>>
>       >>> On Mon, Sep 5, 2016 at 12:11 PM, Bert Gunter
>       ><bgunter.4567 at gmail.com>
>       >>> wrote:
>       >>> > Jun:
>       >>> >
>       >>> > You need to provide a clear specification via regular expressions
>       >of
>       >>> > the patterns you wish to match -- at least for me to decipher it.
>       >>> > Others may be smarter than I, though...
>       >>> >
>       >>> > Jeff: Thanks. I have now convinced myself that it can be done (a
>       >>> > "proof" of sorts): If pat1, pat2,..., patn are m different
>       >patterns
>       >>> > (in a vector of patterns)  to be matched in a vector of n strings,
>       >>> > where only one of the patterns will match in any string,  then use
>       >>> > paste() (probably via do.call()) or otherwise to paste them
>       >together
>       >>> > separated by "|" to form the concatenated pattern, pat. Then
>       >>> >
>       >>> > sub(paste0("^.*(",pat, ").*$"),"\\1",thevector)
>       >>> >
>       >>> > should extract the matching pattern in each (perhaps with a little
>       >>> > fiddling due to precedence rules); e.g.
>       >>> >
>       >>> >> z <-c(".fg.h.g.a.a", "bb..dd.ef.tgf.", "foo...b.bbb.tgy")
>       >>> >
>       >>> >> pat1 <- "a+\\.*a+"
>       >>> >> pat2 <-"b+\\.*b+"
>       >>> >> pat <- c(pat1,pat2)
>       >>> >
>       >>> >> pat <- do.call(paste,c(as.list(pat), sep="|"))
>       >>> >> pat
>       >>> > [1] "a+\\.*a+|b+\\.*b+"
>       >>> >
>       >>> >> sub(paste0("^[^b]*(",pat,").*$"), "\\1", z)
>       >>> > [1] "a.a"   "bb"    "b.bbb"
>       >>> >
>       >>> > Cheers,
>       >>> > Bert
>       >>> >
>       >>> >
>       >>> > Bert Gunter
>       >>> >
>       >>> > "The trouble with having an open mind is that people keep coming
>       >along
>       >>> > and sticking things into it."
>       >>> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>       >>> >
>       >>> >
>       >>> > On Mon, Sep 5, 2016 at 9:56 AM, Jun Shen <jun.shen.ut at gmail.com>
>       >wrote:
>       >>> >> Thanks for the reply, Bert.
>       >>> >>
>       >>> >> Your solution solves the example. I actually have a more general
>       >>> >> situation
>       >>> >> where I have this dot concatenated string from multiple
>       >variables. The
>       >>> >> problem is those variables may have values with dots in there.
>       >The
>       >>> >> number of
>       >>> >> dots are not consistent for all values of a variable. So I am
>       >thinking
>       >>> >> to
>       >>> >> define a vector of patterns for the vector of the string and
>       >hopefully
>       >>> >> to
>       >>> >> find a way to use a pattern from the pattern vector for each
>       >value of
>       >>> >> the
>       >>> >> string vector. The only way I can think of is "for" loop, which
>       >can be
>       >>> >> slow.
>       >>> >> Also these are happening in a function I am writing. Just wonder
>       >if
>       >>> >> there is
>       >>> >> another more efficient way. Thanks a lot.
>       >>> >>
>       >>> >> Jun
>       >>> >>
>       >>> >> On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter
>       ><bgunter.4567 at gmail.com>
>       >>> >> wrote:
>       >>> >>>
>       >>> >>> Well, he did provide an example, and...
>       >>> >>>
>       >>> >>>
>       >>> >>> > z <- c('TX.WT.CUT.mean','mg.tx.cv')
>       >>> >>>
>       >>> >>> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z)
>       >>> >>> [1] "WT.CUT" "tx"
>       >>> >>>
>       >>> >>>
>       >>> >>> ## seems to do what was requested.
>       >>> >>>
>       >>> >>> Jeff would have to amplify on his initial statement however: do
>       >you
>       >>> >>> mean that separate patterns can always be combined via "|" ?  Or
>       >>> >>> something deeper?
>       >>> >>>
>       >>> >>> Cheers,
>       >>> >>> Bert
>       >>> >>> Bert Gunter
>       >>> >>>
>       >>> >>> "The trouble with having an open mind is that people keep coming
>       >along
>       >>> >>> and sticking things into it."
>       >>> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
>       >)
>       >>> >>>
>       >>> >>>
>       >>> >>> On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller
>       >>> >>> <jdnewmil at dcn.davis.ca.us>
>       >>> >>> wrote:
>       >>> >>> > Your opening assertion is false.
>       >>> >>> >
>       >>> >>> > Provide a reproducible example and someone will demonstrate.
>       >>> >>> > --
>       >>> >>> > Sent from my phone. Please excuse my brevity.
>       >>> >>> >
>       >>> >>> > On September 4, 2016 9:06:59 PM PDT, Jun Shen
>       >>> >>> > <jun.shen.ut at gmail.com>
>       >>> >>> > wrote:
>       >>> >>> >>Dear list,
>       >>> >>> >>
>       >>> >>> >>I have a vector of strings that cannot be described by one
>       >pattern.
>       >>> >>> >> So
>       >>> >>> >>let's say I construct a vector of patterns in the same length
>       >as the
>       >>> >>> >>vector
>       >>> >>> >>of strings, can I do the element wise pattern recognition and
>       >string
>       >>> >>> >>substitution.
>       >>> >>> >>
>       >>> >>> >>For example,
>       >>> >>> >>
>       >>> >>> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)"
>       >>> >>> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)"
>       >>> >>> >>
>       >>> >>> >>patterns <- c(pattern1,pattern2)
>       >>> >>> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv')
>       >>> >>> >>
>       >>> >>> >>Say I want to extract "WT.CUT" from the first string and "tx"
>       >from
>       >>> >>> >> the
>       >>> >>> >>second string. If I do
>       >>> >>> >>
>       >>> >>> >>sub(patterns, '\\2', strings), only the first pattern will be
>       >used.
>       >>> >>> >>
>       >>> >>> >>looping the patterns doesn't work the way I want. Appreciate
>       >any
>       >>> >>> >>comments.
>       >>> >>> >>Thanks.
>       >>> >>> >>
>       >>> >>> >>Jun
>       >>> >>> >>
>       >>> >>> >>       [[alternative HTML version deleted]]
>       >>> >>> >>
>       >>> >>> >>______________________________________________
>       >>> >>> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>       >see
>       >>> >>> >>https://stat.ethz.ch/mailman/listinfo/r-help
>       >>> >>> >>PLEASE do read the posting guide
>       >>> >>> >>http://www.R-project.org/posting-guide.html
>       >>> >>> >>and provide commented, minimal, self-contained, reproducible
>       >code.
>       >>> >>> >
>       >>> >>> > ______________________________________________
>       >>> >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>       >see
>       >>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
>       >>> >>> > PLEASE do read the posting guide
>       >>> >>> > http://www.R-project.org/posting-guide.html
>       >>> >>> > and provide commented, minimal, self-contained, reproducible
>       >code.
>       >>> >>
>       >>> >>
>       >>
>       >>
> 
> 
> 
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------


More information about the R-help mailing list