[R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Feb 2 13:20:59 CET 2011


'Strange' to have no response on this.  Can a knowledgeable Danish 
writer please confirm that this is how the OSes are supposed to handle 
Danish collation?

On Mon, 24 Jan 2011, Prof Brian Ripley wrote:

> On Mon, 24 Jan 2011, Søren Højsgaard wrote:
>
>> Dear list,
>> 
>> Please consider the following call of sort
>> 
>>> sort(c("a","f"))
>> [1] "a" "f"
>>> sort(c("f","a"))
>> [1] "a" "f"
>>> 
>>> sort(c("aa","ff"))
>> [1] "ff" "aa"
>>> sort(c("ff","aa"))
>> [1] "ff" "aa"
>> The last two results look strange to me. Is that a bug???
>
> It seems that you and your OS disagree about Danish, and I'm in no position 
> to know which is correct.  But this is not an R issue: the sorting is done by 
> OS services.
>
>> The result seems to come from calls to order:
>> 
>>> order(c("a","f"))
>> [1] 1 2
>>> order(c("f","a"))
>> [1] 2 1
>>> 
>>> order(c("aa","ff"))
>> [1] 2 1
>>> order(c("ff","aa"))
>> [1] 1 2
>
>> I get the same results on R.2.12.1, R.2.11.1 and R.2.13.0 on Windows 7. 
>> However on Linux, I get the "right answer" (the answer I expected). From 
>> the help pages I get the impression that there might be an issue about 
>> locale, but I didn't understand the details.
>> 
>> Can anyone tell me what goes on here, please
>
> I recall that 'aa' used to sort at the end of the alphabet in Danish 
> telephone books, so it seems the sort used on Windows thinks so too. See 
> ?Comparison for some further details.  What I don't understand is that 
> someone resident in Denmark finds this strange ....
>
> I get exactly the same in a Danish locale on Mac OS X, for example:
>
>> sort(c("aa","ff"))
> [1] "ff" "aa"
>
> and also on my Linux box (Fedora 14 with LC_COLLATE=da_DK.utf8)
>
>> sort(c("aa","ff"))
> [1] "ff" "aa"
>
> en_DK is not a Danish locale (in is English in Denmark).  If you want an 
> English sort, try an English locale for LC_COLLATE (there may well be 
> several, hence 'an').
>
>> 
>> Regards
>> Søren
>> 
>> 
>> 
>> 
>> 
>> 
>>> sessionInfo()
>> R version 2.12.1 Patched (2010-12-27 r53883)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>> locale:
>> [1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252
>> [3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
>> [5] LC_TIME=Danish_Denmark.1252
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> other attached packages:
>> [1] SHDtools_1.0
>> 
>> 
>>> sessionInfo()
>> R version 2.12.1 (2010-12-16)
>> Platform: i686-pc-linux-gnu (32-bit)
>> locale:
>> [1] LC_CTYPE=en_DK.utf8       LC_NUMERIC=C
>> [3] LC_TIME=en_DK.utf8        LC_COLLATE=en_DK.utf8
>> [5] LC_MONETARY=C             LC_MESSAGES=en_DK.utf8
>> [7] LC_PAPER=en_DK.utf8       LC_NAME=C
>> [9] LC_ADDRESS=C              LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_DK.utf8 LC_IDENTIFICATION=C
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-help mailing list