[R] sort()ing strings

Martin Maechler maechler at stat.math.ethz.ch
Thu Oct 13 13:00:29 CEST 2016


>>>>> Bob O'Hara <rni.boh at gmail.com>
>>>>>     on Thu, 13 Oct 2016 11:55:04 +0200 writes:

    > Yes, thanks. That seems to be it: 
    > thing <- c("M1", "M2", "M.1", "M.2")
    >> sort(thing)
    > [1] "M1" "M.1" "M2" "M.2"

which I do find strange, indeed, given your sessionInfo which
contains
	LC_COLLATE=en_US.UTF-8

    > The only documentation I can find is from ?Comparison:
    > "Collation of non-letters (spaces, punctuation signs,
    > hyphens, fractions and so on) is even more problematic."

Well.  That help page contains more information further down,
notably about ICU.  If

     capabilities("ICU")

gives TRUE for you (I assume it will as you use a modern ubuntu version),
you can tweak the behavior to be more "reasonable" than above

via R functions icu(Set|Get)Collate()

BTW, I say "strange" above, because for me - also on modern
Linux (Fedora 24), I "always" see

> sort( c("M1", "M2", "M.1", "M.2") )
[1] "M.1" "M.2" "M1"  "M2" 


> Sys.setlocale("LC_COLLATE", "de_CH.UTF-8")
[1] "de_CH.UTF-8"
> sort( c("M1", "M2", "M.1", "M.2") )
[1] "M.1" "M.2" "M1"  "M2" 
> Sys.setlocale("LC_COLLATE", "en_US.UTF-8")
[1] "en_US.UTF-8"
> sort( c("M1", "M2", "M.1", "M.2") )
[1] "M.1" "M.2" "M1"  "M2" 

> Sys.setlocale("LC_COLLATE", "C") # <--> ASCII ("the ole' time default)
[1] "C"
> sort( c("M1", "M2", "M.1", "M.2") )
[1] "M.1" "M.2" "M1"  "M2" 
> 

I do use a newer R version (3.3.1 patched)
but would not have expected that to matter here.

Martin


    > Indeed.

    > Bob

    > On 13 October 2016 at 11:26, PIKAL Petr
    > <petr.pikal at precheza.cz> wrote:
    >> Hi
    >> 
    >> Just a wild guess. Dot is ignored and the output is
    >> alphabetically sorted.
    >> 
    >> You could try sort it yourself by
    >> 
    >> sort(ls())
    >> 
    >> Cheers Petr
    >> 
    >>> -----Original Message----- From: R-help
    >>> [mailto:r-help-bounces at r-project.org] On Behalf Of Bob
    >>> O'Hara Sent: Thursday, October 13, 2016 10:29 AM To:
    >>> r-help <r-help at r-project.org> Subject: [R] (no subject)
    >>> 
    >>> I've just come across an odd problem with sorting in
    >>> ls(): it doesn't seem to order the object names
    >>> correctly. If I do the following, the order isn't what I
    >>> expect:
    >>> 
    >>> > ls(sorted=TRUE) [1] "AridData" "AridDataToBUGS"
    >>> "Arid.df" "Arid.hpd" "AridPrecip.sd" "Break.df" [7]
    >>> "Break.hpd" "Cols" "Data" "DataFrames" "DataToBUGS"
    >>> "DataToBUGS.nonlog" [13] "FitBRugs" "Fixed.df"
    >>> "Fixed.hpd" "FormatData" "GetCol" "GetHPD" [19]
    >>> "GetMCMC" "GetRow" "HPDIs" "Int.alpha12" "Int.alpha21"
    >>> "ModisData" [25] "ModisDataToBUGS" "Modis.df"
    >>> "ModisFixed.df" "ModisFixed.hpd" "Modis.hpd"
    >>> "ModisPrecip.sd" [31] "ModisShrink.df" "ModisShrink.hpd"
    >>> "ModisYears" "OrigData" "OrigDataToBUGS" "Orig.df" [37]
    >>> "Orig.hpd" "OrigPrecip.sd" "OrigYears" "PlotChecks"
    >>> "PlotEff" "plothpd" [43] "ProvinceNames" "ResNames"
    >>> "ResNamesOrder" "Shrink.df" "Shrink.hpd" "SimInits"
    >>> 
    >>> Specifically, the Modis* objects are sorted like this:
    >>> 
    >>> > ls(sorted=TRUE)[26:30] [1] "Modis.df" "ModisFixed.df"
    >>> "ModisFixed.hpd" "Modis.hpd" "ModisPrecip.sd"
    >>> 
    >>> With Modis.* coming both before and after ModisF*. I
    >>> can't see why there would be any odd problems with
    >>> character sets changing (this was all done on a single
    >>> computer with no weird locale switching), and the
    >>> objects are all created within a single R session:
    >>> 
    >>> > sessionInfo() R version 3.2.5 (2016-04-14) Platform:
    >>> x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu
    >>> 16.04.1 LTS
    >>> 
    >>> locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
    >>> LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8 [5]
    >>> LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
    >>> LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C
    >>> LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8
    >>> LC_IDENTIFICATION=C
    >>> 
    >>> attached base packages: [1] stats graphics grDevices
    >>> utils datasets methods base
    >>> 
    >>> other attached packages: [1] MCMCglmm_2.22.1 ape_3.5
    >>> Matrix_1.2-7.1 RColorBrewer_1.1-2 plyr_1.8.4 coda_0.18-1
    >>> 
    >>> loaded via a namespace (and not attached): [1]
    >>> cubature_1.1-2 corpcor_1.6.8 tools_3.2.5 Rcpp_0.12.7
    >>> nlme_3.1-128 grid_3.2.5 knitr_1.14 [8] tensorA_0.36
    >>> lattice_0.20-34
    >>> 
    >>> Can anyone explain what's going on?
    >>> 
    >>> Bob
    >>> --
    >>> Bob O'Hara
    >>> 
    >>> Biodiversity and Climate Research Centre
    >>> Senckenberganlage 25 D-60325 Frankfurt am Main, Germany
    >>> 
    >>> Tel: +49 69 798 40226 Mobile: +49 1515 888 5440 WWW:
    >>> http://www.bik-f.de/root/index.php?page_id=219 Blog:
    >>> http://occamstypewriter.org/boboh/ Journal of Negative
    >>> Results - EEB: www.jnr-eeb.org



More information about the R-help mailing list