[R] sort() depends on locale

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Jun 15 13:33:07 CEST 2014


On 15/06/2014 12:16, Duncan Murdoch wrote:
> On 15/06/2014, 1:15 AM, Marius Hofert wrote:
>> Hi,
>>
>> If I use invisible(Sys.setlocale("LC_COLLATE", "C")) in ~/.Rprofile, then
>>
>>> sort(c("L.Y", "Lu", "L.Q"))
>> [1] "L.Q" "L.Y" "Lu"
>>
>> whereas using invisible(Sys.setlocale("LC_COLLATE", "en_US.UTF-8")) results in
>>
>>> sort(c("L.Y", "Lu", "L.Q"))
>> [1] "L.Q" "Lu"  "L.Y"
>>
>> I know this issue has appeared already
>> (https://stat.ethz.ch/pipermail/r-help//2012-February/304089.html), I
>> just don't see a reason for the second output: either '.' comes before
>> letters, then the result should be
>> "L.Q" "L.Y" "Lu" or it comes afterwards, then it should be "Lu" "L.Q"
>> "L.Y" -- the above result thus seems inconsistent to any useful notion
>> of 'sort' (?)
>
> I don't see this either, but it appears that on your platform the "." is
> simply being ignored, which might be a useful kind of sorting in some
> contexts.

ICU implements that:

icuSetCollate(locale="en_US", alternate_handling="shifted")
sort(c("L.Y", "Lu", "L.Q"))

See ?icuSetCollate and the references there and in ?Comparison.


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list