[Rd] bug in rank(), order(), is.unsorted() on character vector
hpages at fhcrc.org
Thu Dec 8 10:57:02 CET 2011
On 11-12-07 10:29 AM, Roebuck,Paul L wrote:
> Do this first and try again.
> R> Sys.setlocale("LC_COLLATE", "C")
OK I see it now (in ?Sys.setlocale):
Sys.setlocale("LC_COLLATE", "C") # turn off locale-specific sorting,
Thanks all for the answers!
I never really realized how far some collating sequence could go in
terms of counter-intuitiveness e.g. the fact that LC_COLLATE=en_CA.UTF-8
doesn't preserve the order of the strings when a common suffix is
added to them is scary. Also it's not that LC_COLLATE=en_CA.UTF-8
just ignores the '_' (underscores) and the '.' (dots), that can only be
the first pass, then it needs to break ties in a way that defines a
total order. So it looks like the exact definition of this collating
sequence is counter-intuitive and complicated.
Maybe that's just how things are and the developers that want
portability and reproducibility of their code are already putting
a Sys.setlocale("LC_COLLATE", "C") statement somewhere in their package
to force all their users to be on the same collating sequence.
It sounds a little bit drastic though and it might introduce some
conflicts with other packages.
So maybe a better approach is to only alter LC_COLLATE temporarily
inside the functions where it matters i.e. where the returned value
actually depends on the collating sequence? If I don't do this, then
there is no way I can write a test for my function because the
test would work for me but fail for someone else.
Actually this is the situation I was facing when I did my first post:
I have a function that downloads a list of sequences from the Ensembl
FTP server, sorts them by name, and returns them to the user. I have
a test for that function and the test was working for me when I was
but it was failing when I was doing 'R CMD check'. It seems that
the latter alters LC_COLLATE before running the tests (maybe to
LC_COLLATE=C) but not the former. I fixed this by enforcing
LC_COLLATE=C inside my function.
A naive question: wouldn't everything be simpler if LC_COLLATE=C
was the default for everybody?
> On 12/7/11 3:41 AM, "Hervé Pagès"<hpages at fhcrc.org> wrote:
>> This looks OK:
>>> x<- c("_1_", "1_9", "2_9")
>>  1 2 3
>> But this does not:
>>> xa<- paste(x, "a", sep="")
>>  "_1_a" "1_9a" "2_9a"
>>  2 1 3
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>  LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>>  LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
>>  LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
>>  LC_PAPER=C LC_NAME=C
>>  LC_ADDRESS=C LC_TELEPHONE=C
>>  LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>> attached base packages:
>>  stats graphics grDevices utils datasets methods base
>> loaded via a namespace (and not attached):
>>  tools_2.14.0
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel