[R] Extracting a numeric prefix from a string

Peter Dalgaard p.dalgaard at biostat.ku.dk
Tue Feb 1 00:05:28 CET 2005


(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:

> On 31-Jan-05 R user wrote:
> > You could use something like
> > 
> > y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> > as.numeric(y)
> > 
> > But maybe there's a much nicer way.
> > 
> > Jonne.
> 
> I doubt it -- full marks for neat regexp footwork!

Hmm, I'd have to deduct a few points for forgetting to escape the dot...

> x <- "2a4"
> y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> y
[1] "2a4"
>  as.numeric(y)
[1] NA
Warning message:
NAs introduced by coercion

and maybe a few more for using gsub() where sub() suffices.

There are a few more nits to pick, since "2.", ".2", "2e-7" are also
numbers, but ".", ".e-2" are not. In fact it seems quite hard even to
handle all cases in, e.g.,

 x <- c("2.2abc","2.def",".2ghi",".jkl")

with a single regular expression. The first one that worked for me was

> r <- regexpr('^(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))',x)
> substr(x,r,r+attr(r,"match.length")-1)
[1] "2.2" "2."  ".2"  ""

but several "obvious" attempts had failed.

The problem is that regular expressions try to find the
longest match, but not necessary of subexpressions, so

> sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))?.*','\\1',x)
[1] "2." "2." ".2" ""

even though

> sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))','XXX',x)
[1] "XXXabc" "XXXdef" "XXXghi" ".jkl"

Actually, this one comes pretty close:

> sub('([0-9]*(\\.[0-9]+)?)?.*','\\1',x)
[1] "2.2" "2"   ".2"  ""

It only loses a trailing dot which is immaterial in the present
context. However, next try extending the RE to handle an exponent
part... 

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907




More information about the R-help mailing list