| icuSetCollate {base} | R Documentation | 
Setup Collation by ICU
Description
Controls the way collation is done by ICU (an optional part of the R build).
Usage
icuSetCollate(...)
icuGetCollate(type = c("actual", "valid"))
Arguments
| ... | named arguments, see ‘Details’. | 
| type | a character string: either the  | 
Details
Optionally, R can be built to collate character strings by ICU
(https://icu.unicode.org/).  For such systems,
icuSetCollate can be used to tune the way collation is done.
On other builds calling this function does nothing, with a warning.
Possible arguments are
- locale:
- A character string such as - "da_DK"giving the language and country whose collation rules are to be used. If present, this should be the first argument.
- case_first:
- "upper",- "lower"or- "default", asking for upper- or lower-case characters to be sorted first. The default is usually lower-case first, but not in all languages (not under the default settings for Danish, for example).
- alternate_handling:
- Controls the handling of ‘variable’ characters (mainly punctuation and symbols). Possible values are - "non_ignorable"(primary strength) and- "shifted"(quaternary strength).
- strength:
- Which components should be used? Possible values - "primary",- "secondary",- "tertiary"(default),- "quaternary"and- "identical".
- french_collation:
- In a French locale the way accents affect collation is from right to left, whereas in most other locales it is from left to right. Possible values - "on",- "off"and- "default".
- normalization:
- Should strings be normalized? Possible values are - "on"and- "off"(default). This affects the collation of composite characters.
- case_level:
- An additional level between secondary and tertiary, used to distinguish large and small Japanese Kana characters. Possible values - "on"and- "off"(default).
- hiragana_quaternary:
- Possible values - "on"(sort Hiragana first at quaternary level) and- "off".
Only the first three are likely to be of interest except to those with a detailed understanding of collation and specialized requirements.
Some special values are accepted for locale:
- "none":
- ICU is not used for collation: the OS's collation services are used instead. 
- "ASCII":
- ICU is not used for collation: the C function - strcmpis used instead, which should sort byte-by-byte in (unsigned) numerical order.
- "default":
- 
obtains the locale from the OS as is done at the start of the session (except on Windows). If environment variable R_ICU_LOCALE is set to a non-empty value, its value is used rather than consulting the OS, unless environment variable LC_ALL is set to 'C' (or unset but LC_COLLATE is set to 'C'). 
- "",- "root":
- 
the ‘root’ collation: see https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation. 
For the specifications of ‘real’ ICU locales, see
https://unicode-org.github.io/icu/userguide/locale/.  Note that ICU does not
report that a locale is not supported, but falls back to its idea of
‘best fit’ (which could be rather different and is reported by
icuGetCollate("actual"), often "root").  Most English
locales fall back to "root" as although e.g. "en_GB" is
a valid locale (at least on some platforms), it contains no special
rules for collation.  Note that "C" is not a supported ICU locale
and hence R_ICU_LOCALE should never be set to "C".
Some examples are case_level = "on", strength = "primary" to ignore
accent differences and alternate_handling = "shifted" to ignore
space and punctuation characters.
Initially ICU will not be used for collation if the OS is set to use the
C locale for collation and R_ICU_LOCALE is not set.  Once
this function is called with a value for locale, ICU will be used
until it is called again with locale = "none".  ICU will not be
used once Sys.setlocale is called with a "C" value for
LC_ALL or LC_COLLATE, even if R_ICU_LOCALE is set. 
ICU will be used again honoring R_ICU_LOCALE once
Sys.setlocale is called to set a different collation order. 
Environment variables LC_ALL (or LC_COLLATE) take precedence
over R_ICU_LOCALE if and only if they are set to 'C'.  Due to the
interaction with other ways of setting the collation order,
R_ICU_LOCALE should be used with care and only when needed.
All customizations are reset to the default for the locale if
locale is specified: the collation engine is reset if the
OS collation locate category is changed by Sys.setlocale.
Value
For icuGetCollate, a character string describing the ICU locale
in use (which may be reported as "ICU not in use").  The
‘actual’ locale may be simpler than the requested locale: for
example "da" rather than "da_DK": English locales are
likely to report "root".
Note
Except on Windows, ICU is used by default wherever it is available. As it works internally in UTF-8, it will be most efficient in UTF-8 locales.
On Windows, R is normally built including ICU, but it will only be
used if environment variable R_ICU_LOCALE had been set when R
is started or after icuSetCollate is called to select the
locale (as ICU and Windows differ in their idea of locale names).
Note that icuSetCollate(locale = "default") should work
reasonably well, but finds the system default ignoring environment
variables such as LC_COLLATE.
See Also
capabilities for whether ICU is available;
extSoftVersion for its version.
The ICU user guide chapter on collation (https://unicode-org.github.io/icu/userguide/collation/).
Examples
## These examples depend on having ICU available, and on the locale.
## As we don't know the current settings, we can only reset to the default.
if(capabilities("ICU")) withAutoprint({
    icuGetCollate()
    icuGetCollate("valid")
    x <- c("Aarhus", "aarhus", "safe", "test", "Zoo")
    sort(x)
    icuSetCollate(case_first = "upper"); sort(x)
    icuSetCollate(case_first = "lower"); sort(x)
    ## Danish collates upper-case-first and with 'aa' as a single letter
    icuSetCollate(locale = "da_DK", case_first = "default"); sort(x) 
    ## Estonian collates Z between S and T
    icuSetCollate(locale = "et_EE"); sort(x)
    icuSetCollate(locale = "default"); icuGetCollate("valid")
})