[Rd] Bug in rank with utf8?

peter dalgaard pdalgd at gmail.com
Thu Aug 13 16:19:15 CEST 2015


Yes, collation is a strange thing, and? 

Collation order will depend on locale settings, and there are quite a few cases where the collation order of two items is not defined. 

To add to the confusion, on OSX Mavericks, I see

> x <- "\u0663"
> y <- 3
> 
> x == y
[1] FALSE
> rank(c(x, y))
[1] 2 1
> x
[1] "٣"
> x == y
[1] FALSE
> x > y
[1] TRUE
> x < y
[1] FALSE

> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
> Sys.getlocale("LC_COLLATE")
[1] "en_US.UTF-8"

Notice the differences from en_US.UTF8 (sans hyphen) on your system....

-pd

On 13 Aug 2015, at 16:01 , John McKown <john.archie.mckown at gmail.com> wrote:

> 2015-08-13 8:39 GMT-05:00 Hadley Wickham <h.wickham at gmail.com>:
> 
>> x <- "\u0663"
>> y <- 3
>> 
>> x == y
>> # FALSE
>> rank(c(x, y))
>> # c(1.5, 1.5)
>> 
> 
> ​also interesting, and confusing to me:
> 
>> x == y
> [1] FALSE
>> x > y
> [1] FALSE
>> x < y
> [1] FALSE
>> 
> 
> With some slight changes:
> 
>> x <- "\u0663"
>> y <- "3"
>> xy <- c(x,y)
>> rank(xy);
> [1] 1.5 1.5
>> Sys.getlocale();
> [1]
> "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"
>> Sys.setlocale(category="LC_COLLATE", locale="C");
> [1] "C"
>> rank(xy);
> [1] 2 1
>> 
> 
> 
> 
>> --
>> http://had.co.nz/
>> 
>> 
> -- 
> 
> Schrodinger's backup: The condition of any backup is unknown until a
> restore is attempted.
> 
> Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.
> 
> He's about as useful as a wax frying pan.
> 
> 10 to the 12th power microphones = 1 Megaphone
> 
> Maranatha! <><
> John McKown
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list