[R] Regex engine types

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Jun 10 08:47:07 CEST 2006


?regex does describe this:

      A range of characters may be specified by giving the first and last
      characters, separated by a hyphen.  (Character ranges are
      interpreted in the collation order of the current locale.)

You did not tell us your locale, but based on questions from you in the 
past I would guess en_NZ.utf8.  In that locale the collation order is 
wWxXyYzZ, so your surprise is explained.  (It seems the PCRE code is not 
using the same ordering in that locale.)

You may find it useful to set LC_COLLATE to C as I do:

> strsplit(Sys.getlocale(), ";")
[[1]]
  [1] "LC_CTYPE=en_GB"       "LC_NUMERIC=C"         "LC_TIME=en_GB"
  [4] "LC_COLLATE=C"         "LC_MONETARY=en_GB"    "LC_MESSAGES=en_GB"
  [7] "LC_PAPER=en_GB"       "LC_NAME=C"            "LC_ADDRESS=C"
[10] "LC_TELEPHONE=C"       "LC_MEASUREMENT=en_GB" "LC_IDENTIFICATION=C"


On Sat, 10 Jun 2006, Patrick Connolly wrote:

>> version
>         _
> platform x86_64-unknown-linux-gnu
> arch     x86_64
> os       linux-gnu
> system   x86_64, linux-gnu
> status
> major    2
> minor    2.1
> year     2005
> month    12
> day      20
> svn rev  36812
> language R
>>
>
>> grep("[W-Z]", LETTERS, value = TRUE)
> [1] "W" "X" "Y" "Z"
>
> That's what I'd have expected.
>
>> grep("[W-Z]", letters, value = TRUE)
> [1] "x" "y" "z"
>
> Not what I'd have thought.  However,
>
>> grep("[B-D]", letters, value = TRUE, perl = TRUE)
> character(0)
>
> So what is it that standard regular expressions use that's different
> from Perl-type ones?
>
> The help file for grep refers to POSIX 1003.2 which looked a bit
> daunting to delve into.  From my limited reading, it seems there are
> different gegex "Engine Types" which seems to be getting somewhat
> tangential to what I was working on.  I could probably avoid problems
> if I always set perl=TRUE, but it would be good to know what basic and
> extended regular expressions do that's different.  If someone has a
> quick line or two describing it, I'd be interested to know.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list