[R] Retain last grouping after a strsplit()

David Winsemius dwinsemius at comcast.net
Tue Dec 11 21:24:19 CET 2012


On Dec 11, 2012, at 11:14 AM, Steven Ranney wrote:

> David and Jim -
> 
> Thanks for your help.  Your suggestions worked just fine.  Now my task
> is to learn why the random-looking string of characters in the first
> part of Jim's sub() statement aren't really so random.
> 

Jim's solution can be read as:

Pattern matching phase:

continue along all the characters, ".*?" from the beginning "^" until you encounter any characters in the range "0" to "9" that are all together just before the end ("$"). Label or store those in-range characters as matched group numbered "\\1". The entire pattern will match the whole string.

Substitution phase:

Replace what is matched (the whole string in this case)  with just the first numbered matched group, "\\1".


Notice that this could be thought of as a "positive replacement" in contrast to my solution and Gabor Grothendieck's later and slightly more compact version which could be called "negative replacements".
-- 
David

> Thanks again -
> 
> SR
> Steven H. Ranney
> 
> 
> On Tue, Dec 11, 2012 at 11:37 AM, David Winsemius
> <dwinsemius at comcast.net> wrote:
>> 
>> On Dec 11, 2012, at 10:10 AM, jim holtman wrote:
>> 
>>> try this:
>>> 
>>>> x
>>> 
>>> [1] "OYS-PIA2-FL-1"  "OYS-PIA2-LA-1"  "OYS-PI-LA-BB-1" "OYS-PIA2-LA-10"
>>>> 
>>>> sub("^.*?([0-9]+)$", "\\1", x)
>>> 
>>> [1] "1"  "1"  "1"  "10"
>>>> 
>>>> 
>>> 
>>> 
>> 
>> Steve;
>> 
>> jim holtman is one of the jewels of the rhelp world. I generally assume that
>> his answers are going to be the most succinct and efficient ones possible
>> and avoid adding noise, but here I thought I would try to improve. Thinking
>> there might be a string-splitting approach I first tried (and discovered a
>> not-so-great solution:
>> 
>> x <- c("OYS-PIA2-FL-1",  "OYS-PIA2-LA-1",  "OYS-PI-LA-BB-1",
>> "OYS-PIA2-LA-10")
>> sapply( strsplit(x, "-") , "[", 4)
>> [1] "1"  "1"  "BB" "10"
>> 
>> So then I asked myself if we could just "blank out" everything before the
>> last das, finding what seemed to be a fairly economical solution and one
>> that does not require back-references:
>> 
>> sub( "^.+-" , "", x)
>> 
>> [1] "1"  "1"  "1"  "10"
>> 
>> If there were no digits after the last dash these approaches give different
>> results:
>> 
>> x <- c("OYS-PIA2-FL-1",  "OYS-PIA2-LA-1",  "OYS-PI-LA-BB-1",
>> "OYS-PIA2-LA-")
>> 
>> sub( "^.+-" , "", x)
>> 
>> [1] "1" "1" "1" ""
>> 
>> sub("^.*?([0-9]+)$", "\\1", x)
>> [1] "1"            "1"            "1"            "OYS-PIA2-LA-"
>> 
>> When a grep pattern does not match, sub and gsub will return the whole
>> argument.
>> 
>> --
>> David.
>> 
>>> 
>>> On Tue, Dec 11, 2012 at 12:46 PM, Steven Ranney <steven.ranney at gmail.com>
>>> wrote:
>>>> 
>>>> OYS-PIA2-FL-1
>>>> OYS-PIA2-LA-1
>>>> OYS-PI-LA-BB-1
>>>> OYS-PIA2-LA-10
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>> 
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> 
>> David Winsemius, MD
>> Alameda, CA, USA
>> 

David Winsemius
Alameda, CA, USA




More information about the R-help mailing list