[R] regular expression strikes again

Jan Kim jttkim at googlemail.com
Tue Jul 9 13:16:07 CEST 2013


On Tue, Jul 09, 2013 at 09:45:55AM +0000, PIKAL Petr wrote:
> Dear experts in regexpr.
> 
> I have this
> 
> dput(test[500:510])
> c("pH 9,36 2", "pH 9,36 3", "pH 9,66 1", "pH 9,66 2", "pH 9,66 3", 
> "pH 10,04 1", "pH 10,04 2", "pH 10,04 3", "RGLP 144006 pH 6,13 1", 
> "RGLP 144006 pH 6,13 2", "RGLP 144006 pH 6,13 3")
> 
> and I want something like this
> 
> gsub("^.*([[:digit:]],[[:digit:]]*).*$", "\\1", test[500:510])
>  [1] "9,36" "9,36" "9,66" "9,66" "9,66" "0,04" "0,04" "0,04" "6,13" "6,13"
> [11] "6,13"
> 
> but with 10,04 values instead of 0,04.
> 
> I tried
> gsub("^.*([[:digit:]]+,[[:digit:]]*).*$", "\\1", test[500:510])
> 
> or other variations but without any success.
> 
> Please help.

The "1" in "10,04" is matched by ".*". In your example, all floating
comma numbers you're trying to extract are preceded by "pH ", so
replacing ".*" with ".*pH " should do what you want.

I'd be wary about that variation of having "RGLP 144006" in some
cases, though, it might be better to clean up this rubbish earlier
on (and it would be ideal to never have it generated in the first
place). Regular expressions can be useful to separate some chaff
from the wheat, but relying on that too much comes with a risk of
extracting something that is valid in some syntactic / technical
sense but not correct semantically. If you can't be 100% certain
that the number you want is (1) always preceded by "pH ", (2)
always a floating comma number and (3) will always contain an
integer and a fractional part (i.e. you'll never get ",09" rather
than "0,09", or "10" rather than "10,0"), you have to be prepared
for more difficulties, and you may want to consider a more systematic
approach to parsing your input.

Best regards, Jan
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*



More information about the R-help mailing list