[R] sorting character vectors

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Aug 19 14:10:32 CEST 2004


It is documented to depend on your locale.  I get

>  sort(x)
[1] " A" " B" " C" "A"  "B"  "C"

in the C locale.  The help page does say so:

     The sort order for character vectors will depend on the collating
     sequence of the locale in use: see 'Comparison'.

The default collation sequences for standard locales in Linux distros are
quite unintuitive (and are not character-by-character either).  If you 
want ASCII, ask for it by LC_COLLATE=C.


On Thu, 19 Aug 2004 andreas.krause at pharma.novartis.com wrote:

> The following is not what I expected in sorting characters (single letters 
> and the same letters with preceding spaces).
> Can someone enlighten me as to why the following might be a correct result 
> for sorting?
> 
> ; x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep=""))
> ; x
> [1] "A"  "B"  "C"  " A" " B" " C"
> ; sort(x)
> [1] "A"  " A" "B"  " B" "C"  " C"
> ; sort(x, method="shell")
> [1] "A"  " A" "B"  " B" "C"  " C"
> ; sort(x, method="quick")
> [1] "A"  " A" "B"  " B" "C"  " C"
> 
> I would expect the result to be " A" " B" " C" "A"  "B"  "C" instead, 
> going by ASCII codes (and a quick check with S-Plus 6.2 shows that this is 
> what S-Plus thinks the sorted sequence is).

That explicitly says it uses ASCII.  I believe that is a deficiency they 
plan to correct.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list