[R] difference in sort order linux/Windows (R.2.11.0)

Duncan Murdoch murdoch.duncan at gmail.com
Fri May 28 16:37:39 CEST 2010


On 28/05/2010 9:24 AM, (Ted Harding) wrote:
> An experiment:
>
>   sort(c("AACD","A CD"))
>   #  [1] "AACD" "A CD"
>
>   sort(c("ABCD","A CD"))
>   #  [1] "ABCD" "A CD"
>
>   sort(c("ACCD","A CD"))
>   #  [1] "ACCD" "A CD"
>
>   sort(c("ADCD","A CD"))
>   #  [1] "A CD" "ADCD"
>
>   sort(c("AECD","A CD"))
>   #  [1] "A CD" "AECD"
>   ## (with results for "AFCD", ... "AZCD" similar to the last two).
>
>   LC_COLLATE=en_GB.UTF-8
>
> (R version 2.11.0 (2010-04-22) on Linux).
>
> So this behaves, in en_GB.UTF-8, as though " " (SPACE) is between
> "C" and "D".
>
> This is nuts!!!
>
> Curable if I set (e.g.) LC_LOCALE="C" on startup. But what else
> might break if I do so?
>   

You have to realize that to a large extent this is not under our 
control.  Your system will have linked to some library (outside of R) to 
do string collation, and the problem lies in that library.  You should 
determine which system library is handling your collations.

I'd like to tell you how to do that, but I don't know for your build.  
You can find out if you're using the recommended ICU library by running 
example(icuSetCollate); that gives a number of warnings like

In icuSetCollate(locale = "da_DK", case_first = "default") :
  ICU is not supported on this build

in Windows.  If you don't see those, then you want to talk to the ICU 
people.  If you do, then you'll need to look deeper to find out what 
you're actually using.

Duncan Murdoch
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 28-May-10                                       Time: 14:24:08
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list