[R] Regex engine types

Sat Jun 10 14:55:28 CEST 2006

I get the same result in a US collate ordering:

> strsplit(Sys.getlocale(), ";")
[[1]]
[1] "LC_COLLATE=English_United States.1252"
[2] "LC_CTYPE=English_United States.1252"
[3] "LC_MONETARY=English_United States.1252"
[4] "LC_NUMERIC=C"
[5] "LC_TIME=English_United States.1252"

> grep("[W-Z]", letters, value = TRUE)
[1] "x" "y" "z"
> R.version.string # Windows XP
[1] "Version 2.3.1 Patched (2006-06-04 r38279)"

On 6/10/06, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
> ?regex does describe this:
>
>      A range of characters may be specified by giving the first and last
>      characters, separated by a hyphen.  (Character ranges are
>      interpreted in the collation order of the current locale.)
>
> You did not tell us your locale, but based on questions from you in the
> past I would guess en_NZ.utf8.  In that locale the collation order is
> wWxXyYzZ, so your surprise is explained.  (It seems the PCRE code is not
> using the same ordering in that locale.)
>
> You may find it useful to set LC_COLLATE to C as I do:
>
> > strsplit(Sys.getlocale(), ";")
> [[1]]
>  [1] "LC_CTYPE=en_GB"       "LC_NUMERIC=C"         "LC_TIME=en_GB"
>  [4] "LC_COLLATE=C"         "LC_MONETARY=en_GB"    "LC_MESSAGES=en_GB"
>  [7] "LC_PAPER=en_GB"       "LC_NAME=C"            "LC_ADDRESS=C"
> [10] "LC_TELEPHONE=C"       "LC_MEASUREMENT=en_GB" "LC_IDENTIFICATION=C"
>
>
> On Sat, 10 Jun 2006, Patrick Connolly wrote:
>
> >> version
> >         _
> > platform x86_64-unknown-linux-gnu
> > arch     x86_64
> > os       linux-gnu
> > system   x86_64, linux-gnu
> > status
> > major    2
> > minor    2.1
> > year     2005
> > month    12
> > day      20
> > svn rev  36812
> > language R
> >>
> >
> >> grep("[W-Z]", LETTERS, value = TRUE)
> > [1] "W" "X" "Y" "Z"
> >
> > That's what I'd have expected.
> >
> >> grep("[W-Z]", letters, value = TRUE)
> > [1] "x" "y" "z"
> >
> > Not what I'd have thought.  However,
> >
> >> grep("[B-D]", letters, value = TRUE, perl = TRUE)
> > character(0)
> >
> > So what is it that standard regular expressions use that's different
> > from Perl-type ones?
> >
> > The help file for grep refers to POSIX 1003.2 which looked a bit
> > daunting to delve into.  From my limited reading, it seems there are
> > different gegex "Engine Types" which seems to be getting somewhat
> > tangential to what I was working on.  I could probably avoid problems
> > if I always set perl=TRUE, but it would be good to know what basic and
> > extended regular expressions do that's different.  If someone has a
> > quick line or two describing it, I'd be interested to know.
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>