[Rd] bug in rank(), order(), is.unsorted() on character vector

peter dalgaard pdalgd at gmail.com
Wed Dec 7 19:30:10 CET 2011


On Dec 7, 2011, at 15:48 , Joris Meys wrote:

> @Barry : regardless of whether '_' comes before or after '1' , it
> should be consistent. Adding an 'a' shouldn't shift '_' from before
> '1' to between '1' and '2', that's clearly an error. The help files
> are not stating anything about that. The only thing I can imagine, is
> that '_' gets ignored (in that case 19a would rank before 1a).

As far as I remember, that is exactly the case. In some locales, and not even consistently across different OS versions of the "same" locale, there are characters that are ignored for collation. With that in mind, what we see is really not any stranger than "a" < "ab" but "ac" > "abc". 

R just uses what the OS supplies, so if you want to use words like "inconsistent" or "error", please direct them at those who define the locales. (And be prepared to realize that you may have kicked a hornet's nest...)

> 
> This said, I can't reproduce.
> 
>> x <- c("_1_", "1_9", "2_9")
>> xa <- paste(x,'a',sep='')
>> rank(x)
> [1] 1 2 3
>> rank(xa)
> [1] 1 2 3
> 
>> sessionInfo()
> R version 2.14.0 Patched (2006-00-00 r00000)
> Platform: i386-pc-mingw32/i386 (32-bit)
> 
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> States.1252    LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C                           LC_TIME=English_United
> States.1252
> 
> attached base packages:
> [1] grDevices datasets  splines   graphics  stats     tcltk     utils
>   methods   base
> 
> other attached packages:
> [1] svSocket_0.9-51 TinnR_1.0.3     R2HTML_2.2      Hmisc_3.8-3
> survival_2.36-9
> 
> loaded via a namespace (and not attached):
> [1] cluster_1.14.1  grid_2.14.0     lattice_0.19-33 svMisc_0.9-63
> tools_2.14.0
> 
> 
> 2011/12/7 Hervé Pagès <hpages at fhcrc.org>:
>> Hi,
>> 
>> This looks OK:
>> 
>>> x <- c("_1_", "1_9", "2_9")
>>> rank(x)
>> [1] 1 2 3
>> 
>> But this does not:
>> 
>>> xa <- paste(x, "a", sep="")
>>> xa
>> [1] "_1_a" "1_9a" "2_9a"
>>> rank(xa)
>> [1] 2 1 3
>> 
>> Cheers,
>> H.
>> 
>>> sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> 
>> locale:
>>  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>>  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> loaded via a namespace (and not attached):
>> [1] tools_2.14.0
>> 
>> 
>> --
>> Hervé Pagès
>> 
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>> 
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
> 
> -- 
> Joris Meys
> Statistical consultant
> 
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
> 
> tel : +32 9 264 59 87
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list