[R] a question of alphabetical order

[Ricardo Rodriguez] Your XEN ICT Team webmaster at xen.net
Tue Apr 15 23:31:00 CEST 2008


Tricky question, this order issue :-(

Thank you so much for the detailed explanation.

Thus, please, must I conclude that I will have to survive with this 
ASCII order while working in Mac OS X 10.5.2 until Mac people fix this bug?

You spoke about es_ES.ISO8859-15 in Mac. Will it do the trick? Yes, as 
far as I understand. But as I am using R.app, locale is set by the 
system preferences. Truly, I am kind of a mess with this issue.

Could I force es_ES.ISO8859-15 as a locale in the Mac.

Sorry of I put another question here... why does Excel order list 
correctly? I guess it doesn't relies on Mac settings.

As a R newbie I must recognize that this, and others, behaviours are 
really hard to deal with. But I've seen, an even done, such an amount of 
wonderful things with R that it is worth all efforts. Thanks for your help.

All the best,

Ricardo


Prof Brian Ripley wrote:
> This is a known Mac OS X bug, nothing to do with R which uses the 
> system functions (strcoll/wcscoll) for such things.
>
> If you look at the help for sort, it refers you to ?Comparison.  Which 
> says
>
>      Comparison of strings in character vectors is lexicographic within
>      the strings using the collating sequence of the locale in use: see
>      'locales'.  The collating sequence of locales such as 'en_US' is
>      normally different from 'C' (which should use ASCII) and can be
>      surprising.  Beware of making _any_ assumptions about the
>      collation order: e.g. in Estonian 'Z' comes between 'S' and 'T',
>      and collation is not necessarily character-by-character - in
>      Danish 'aa' sorts as a single letter, after 'z'.  Some platforms
>      may not respect the locale and always sort in ASCII.  (String
>      comparison is always for the part of the string up to the first
>      nul if there are embedded nuls.)
>
> Mac OS X (more specifically, 10.5.2 on i386) is one of those 
> disrespectful platforms.
>
>> x <- intToUtf8(c(32:127, 160:255), multiple=T)
>> order(x)
>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  
> 17 18
>  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  
> 35 36
>  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  
> 53 54
>  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  
> 71 72
>  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  
> 89 90
>  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 
> 107 108
> [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 
> 125 126
> [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 
> 143 144
> [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
> 161 162
> [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 
> 179 180
> [181] 181 182 183 184 185 186 187 188 189 190 191 192
>
> which is quite different from Linux or Solaris.  This may not come 
> out, but paste(sort(x), collapse="") includes
>
> aAªáÁàÀâÂåÅäÄãÃæÆbBcCçÇdDeEéÉèÈêÊëË
>
> on Linux in es_ES.utf8 .
>
> Platforms are a lot worse at sorting in UTF-8 than 8-bit encodings.  
> Mac OS X has es_ES.ISO8859-15, and that does do a reasonable job 
> including aáàâåäãæ .


-- 
Ricardo Rodríguez
Your XEN ICT Team



More information about the R-help mailing list