[R] Extracting a numeric prefix from a string

McGehee, Robert Robert.McGehee at geodecapital.com
Tue Feb 1 00:49:38 CET 2005


Perhaps an easier way would be to throw away the offending text at the
end of the strings, rather than matching all possible numeric
formulations at the beginning of the string, that is:

sub("\\.*[[:alpha:]]+$", "", x)

Easier to read, if nothing else, and it allows for 2e-7 as a valid
number. This however (I think correctly) assumes that there aren't
numbers in the middle of the string, i.e. 2a3b.

Robert

-----Original Message-----
From: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk] 
Sent: Monday, January 31, 2005 6:05 PM
To: ted.harding at nessie.mcc.ac.uk
Cc: R user; R-help at stat.math.ethz.ch; Mike White
Subject: Re: [R] Extracting a numeric prefix from a string


(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:

> On 31-Jan-05 R user wrote:
> > You could use something like
> > 
> > y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> > as.numeric(y)
> > 
> > But maybe there's a much nicer way.
> > 
> > Jonne.
> 
> I doubt it -- full marks for neat regexp footwork!

Hmm, I'd have to deduct a few points for forgetting to escape the dot...

> x <- "2a4"
> y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> y
[1] "2a4"
>  as.numeric(y)
[1] NA
Warning message:
NAs introduced by coercion

and maybe a few more for using gsub() where sub() suffices.

There are a few more nits to pick, since "2.", ".2", "2e-7" are also
numbers, but ".", ".e-2" are not. In fact it seems quite hard even to
handle all cases in, e.g.,

 x <- c("2.2abc","2.def",".2ghi",".jkl")

with a single regular expression. The first one that worked for me was

> r <- regexpr('^(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))',x)
> substr(x,r,r+attr(r,"match.length")-1)
[1] "2.2" "2."  ".2"  ""

but several "obvious" attempts had failed.

The problem is that regular expressions try to find the
longest match, but not necessary of subexpressions, so

> sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))?.*','\\1',x)
[1] "2." "2." ".2" ""

even though

> sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))','XXX',x)
[1] "XXXabc" "XXXdef" "XXXghi" ".jkl"

Actually, this one comes pretty close:

> sub('([0-9]*(\\.[0-9]+)?)?.*','\\1',x)
[1] "2.2" "2"   ".2"  ""

It only loses a trailing dot which is immaterial in the present
context. However, next try extending the RE to handle an exponent
part... 

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list