[R] Extracting a numeric prefix from a string

Mike White mikewhite.diu at tiscali.co.uk
Wed Feb 2 10:50:00 CET 2005


Thanks for you contributions.  Jonnes' solution (after sorting) works fine
for my purposes but it would be useful to have a function that works for any
numeric prefix.  Another case to include would be a signed numeric:
x<-c("+12.3.abc", "-0.12xyz")

Mike
----- Original Message -----
From: "Peter Dalgaard" <p.dalgaard at biostat.ku.dk>
To: <ted.harding at nessie.mcc.ac.uk>
Cc: "R user" <R-user at zutt.org>; <R-help at stat.math.ethz.ch>; "Mike White"
<mikewhite.diu at tiscali.co.uk>
Sent: Monday, January 31, 2005 11:05 PM
Subject: Re: [R] Extracting a numeric prefix from a string


> (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:
>
> > On 31-Jan-05 R user wrote:
> > > You could use something like
> > >
> > > y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> > > as.numeric(y)
> > >
> > > But maybe there's a much nicer way.
> > >
> > > Jonne.
> >
> > I doubt it -- full marks for neat regexp footwork!
>
> Hmm, I'd have to deduct a few points for forgetting to escape the dot...
>
> > x <- "2a4"
> > y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> > y
> [1] "2a4"
> >  as.numeric(y)
> [1] NA
> Warning message:
> NAs introduced by coercion
>
> and maybe a few more for using gsub() where sub() suffices.
>
> There are a few more nits to pick, since "2.", ".2", "2e-7" are also
> numbers, but ".", ".e-2" are not. In fact it seems quite hard even to
> handle all cases in, e.g.,
>
>  x <- c("2.2abc","2.def",".2ghi",".jkl")
>
> with a single regular expression. The first one that worked for me was
>
> > r <- regexpr('^(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))',x)
> > substr(x,r,r+attr(r,"match.length")-1)
> [1] "2.2" "2."  ".2"  ""
>
> but several "obvious" attempts had failed.
>
> The problem is that regular expressions try to find the
> longest match, but not necessary of subexpressions, so
>
> > sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))?.*','\\1',x)
> [1] "2." "2." ".2" ""
>
> even though
>
> > sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))','XXX',x)
> [1] "XXXabc" "XXXdef" "XXXghi" ".jkl"
>
> Actually, this one comes pretty close:
>
> > sub('([0-9]*(\\.[0-9]+)?)?.*','\\1',x)
> [1] "2.2" "2"   ".2"  ""
>
> It only loses a trailing dot which is immaterial in the present
> context. However, next try extending the RE to handle an exponent
> part...
>
> --
>    O__  ---- Peter Dalgaard             Blegdamsvej 3
>   c/ /'_ --- Dept. of Biostatistics     2200 Cph. N
>  (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
>




More information about the R-help mailing list