[R] unexpected behaviour of R-2.10.1 regular expression in UTF-8 locale

axel.klenk at actelion.com axel.klenk at actelion.com
Thu Jan 21 17:36:44 CET 2010



Dear R-helpers,

I have encountered the following unexpected behaviour of R-2.10.1, but not
R-2.9.0,
on both RHEL 4 and Ubuntu Karmic (precompiled via synaptic or built from
source).

I have a character vector from which I want to extract a certain pattern
that is surrounded
by junk as in:

> nn <- sprintf("junk_%02d_junk", 1:2)
> nn
[1] "junk_01_junk" "junk_02_junk"

> sub("^.*([[:digit:]]{2}).*$", "\\1", nn)
[1] "nk" "nk"
# oops? however:

> sub("^.*([[:digit:]]{2}).*$", "\\1", nn, perl = TRUE)
[1] "01" "02"

# as expected, and also

> Sys.setlocale("LC_ALL", "C")
[1]
"LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
> sub("^.*([[:digit:]]{2}).*$", "\\1", nn)
[1] "01" "02"

Is there something wrong with my regex syntax or am I missing something
else?
Obviously I have at least two workarounds but I'd like to report this since
it is
breaking code that used to run in R-2.9.0.

Thanks in advance for any help or insight,

 - axel



$ R --vanilla

R version 2.10.1 (2009-12-14)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> sessionInfo()
R version 2.10.1 (2009-12-14)
x86_64-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base



Axel Klenk
Research Informatician
Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil /
Switzerland



The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged.
It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email.
The content of this email is not legally binding unless confirmed by letter.
Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com



More information about the R-help mailing list