[R] difference in sort order linux/Windows (R.2.11.0)

(Ted Harding) Ted.Harding at manchester.ac.uk
Fri May 28 22:14:59 CEST 2010


On 28-May-10 14:37:39, Duncan Murdoch wrote:
> On 28/05/2010 9:24 AM, (Ted Harding) wrote:
>> An experiment:
>>
>>   sort(c("AACD","A CD"))
>>   #  [1] "AACD" "A CD"
>>
>>   sort(c("ABCD","A CD"))
>>   #  [1] "ABCD" "A CD"
>>
>>   sort(c("ACCD","A CD"))
>>   #  [1] "ACCD" "A CD"
>>
>>   sort(c("ADCD","A CD"))
>>   #  [1] "A CD" "ADCD"
>>
>>   sort(c("AECD","A CD"))
>>   #  [1] "A CD" "AECD"
>>   ## (with results for "AFCD", ... "AZCD" similar to the last two).
>>
>>   LC_COLLATE=en_GB.UTF-8
>>
>> (R version 2.11.0 (2010-04-22) on Linux).
>>
>> So this behaves, in en_GB.UTF-8, as though " " (SPACE) is between
>> "C" and "D".
>>
>> This is nuts!!!
>>
>> Curable if I set (e.g.) LC_LOCALE="C" on startup. But what else
>> might break if I do so?
>>   
> 
> You have to realize that to a large extent this is not under our 
> control. Your system will have linked to some library (outside of R)
> to do string collation, and the problem lies in that library. You
> should determine which system library is handling your collations.
> 
> I'd like to tell you how to do that, but I don't know for your build.  
> You can find out if you're using the recommended ICU library by
> running example(icuSetCollate); that gives a number of warnings like
> 
> In icuSetCollate(locale = "da_DK", case_first = "default") :
>   ICU is not supported on this build
> 
> in Windows.  If you don't see those, then you want to talk to the ICU 
> people.  If you do, then you'll need to look deeper to find out what 
> you're actually using.
> 
> Duncan Murdoch

Thanks for the further guidance, Duncan. I indeed get 4 such warnings
from example(icuSetCollate), indicating that ICU is not being used.

I have now thrown the above experiment straight at Linux, entering
command-line commands as follows (with the results shown on the
lines starting with "#"):

sort << EOT
"AACD"
"A CD"
EOT
# "AACD"
# "A CD"

sort << EOT
"ABCD"
"A CD"
EOT
# "ABCD"
# "A CD"

sort << EOT
"ACCD"
"A CD"
EOT
# "ACCD"
# "A CD"

sort << EOT
"ADCD"
"A CD"
EOT
# "A CD"
# "ADCD"

This clearly shows that the Linux collating order sees " " (SPACE)
as coming between "C" and "D", as when I tried it in R.

I am now spamming my Linux contacts about it!

The result of the "locale" command in Linux includes:
  LC_COLLATE="en_GB.UTF-8"

This happens consistently on a Debian Lenny and a Debian Etch system.

Thanks,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 28-May-10                                       Time: 21:14:54
------------------------------ XFMail ------------------------------



More information about the R-help mailing list