[R] Regex question to find a string that contains 5-9 alpha-numeric characters, at least one of which is a number

Greg Snow Greg.Snow at imail.org
Tue Jun 9 18:26:50 CEST 2009


Here is one way using a single pattern (so can be used in a substitution), it uses Perl's positive look ahead patters:

> test <- c("SHRT","5HRT","M1TCH","M1TCH5","LONG3RS","NONUMBER","TOOLOOOONGG","ooops.3")
> 
> sub( '(?=[a-zA-Z]{0,8}[0-9])[a-zA-Z0-9]{5,9}', 'xxx', test, perl=TRUE)
[1] "SHRT"        "5HRT"        "xxx"         "xxx"         "xxx"        
[6] "NONUMBER"    "TOOLOOOONGG" "ooops.3"    
>

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Marc Schwartz
> Sent: Monday, June 08, 2009 6:33 PM
> To: Barry Rowlingson
> Cc: r-help at r-project.org; Tan, Richard
> Subject: Re: [R] Regex question to find a string that contains 5-9
> alpha-numeric characters, at least one of which is a number
> 
> 
> On Jun 8, 2009, at 5:27 PM, Barry Rowlingson wrote:
> 
> > On Mon, Jun 8, 2009 at 10:40 PM, Tan, Richard<RTan at panagora.com>
> > wrote:
> >> Hi,
> >>
> >> This is not exactly an R question but I am trying to use gsub to
> >> replace
> >> a string that contains 5-9 alpha-numeric characters, at least one of
> >> which is a number.  Is there a good way to write it in a one line
> >> regex?
> >
> > The only way I can think of is to spell out all the possible
> > expressions, somethinglike:
> >
> > [0-9][a-z0-9]{4} | [a-z0-9][0-9][a-z0-9]{3} |
> > [a-z0-9]{2}[0-9][a-z0-9]{2} .... and so on. That is, have a regex
> > component for every possible 5, 6, 7, 8, and 9 character expression
> > with [0-9] in each place. I'm not sure this qualifies as 'good',
> > though..
> >
> > Better to do it in two stages, one to check for 5-9 alphanumerics,
> > and then another to check for a number.
> >
> > Here's something on a test vector 's':
> >
> >> cbind(s,grepl("^[A-Z0-9]{5,9}$",s),grepl("[0-9]",s))
> >     s
> > [1,] "SHRT"        "FALSE" "FALSE"
> > [2,] "5HRT"        "FALSE" "TRUE"
> > [3,] "M1TCH"       "TRUE"  "TRUE"
> > [4,] "M1TCH5"      "TRUE"  "TRUE"
> > [5,] "LONG3RS"     "TRUE"  "TRUE"
> > [6,] "NONUMBER"    "TRUE"  "FALSE"
> > [7,] "TOOLOOOONGG" "FALSE" "FALSE"
> >
> > The ones you want give two TRUE values. Extending to lower-case is
> > left as an exercise...
> >
> > Barry
> 
> 
> I was trying to think of a way to do this with only a single grep(),
> but it has been too long of a day.
> 
> So here is a bit of a simplification on the two stage approach:
> 
>  > vec
> [1] "SHRT"        "5HRT"        "M1TCH"       "M1TCH5"
> "LONG3RS"     "NONUMBER"    "TOOLOOOONGG"
> 
> 
>  > grep("[0-9]", vec[grep("^[[:alnum:]]{5,9}$", vec)], value = TRUE)
> [1] "M1TCH"   "M1TCH5"  "LONG3RS"
> 
> 
> HTH,
> 
> Marc Schwartz
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list