[R] a question of alphabetical order

Wed Apr 16 13:13:38 CEST 2008

Thanks Hans!

Hans-Joerg Bibiko wrote:
> Hi,
>
> as already mentioned, sorting could be a pain.
>
> My solution to that is to write my own "order" routine for a given 
> language.
> The idea is to transform the UTF-8 string into ASCII in such a way 
> that the built-in order routine outputs the desired result. But this 
> could be a very stony way.
>
> Example for Spanish (please correct me if I'm wrong):
> -accents are ignored

Correct.
> -ll is one single entity and comes after l (ludar comes before llave)

Nope. Nowadays, it is considered as two letters (I mean your use of 
entity here allows me to say two different entities). Thus, here an 
example...

lama, lazo, leve, llave, lluvia, ludar (if such a word exists), luna
> -ch is one single entity and comes after c

Nope. Here another example:

capa, casco, chapa, chepa, cisma, copa, curva

Here the original source (in Spanish)...

http://tinyurl.com/ysm243

>
> The only thing I do not know if it could happen that a 'll' is not one 
> entity but two (maybe the result of the combination of two nouns). If 
> so then the entire story will be much more complicated.

I think this is the case both for ll, ch and even rr although rr was 
never considered as a "single entity"

What I don't know is how Spanish locales consider these rules. I'll try 
to understand it and keep this thread (or a follow up created in the 
r-sig-mac list) updated.
>
> Now the big question is how to delete all these accents in åàÿñü etc. 
> to get aaynu. (technically spoken canonical decomposition of a Unicode 
> string NFKD)
> One possible way is to use a scripting language which can handle it. 
> The only language I know  which can do it as default is python. For 
> ruby, perl one has to install an additional library.
>
> On a Mac system python is installed as default; on Windows not. If 
> this ordering is also an issue for Windows users then one has to 
> install it in beforehand.
>
> The code comes here:
>
> orderES <- function(x) {
>     #decomposes all accented characters
>     str <- NKFD(x)
>
>     #all combining diacritics
>     nonChars <- c(768:879)
>     pattern <- paste("[", intToUtf8(as.integer(nonChars)), "]", sep="")
>
>     #delete all combining diacritics
>     str <- gsub(pattern, "", str)
>
>     #transform ll an ch to l{ and c{ ({ comes after z)
>     str <- gsub("ll", "l{", gsub("ch", "c{", str))
>     order(str)
> }
>
> NKFD <- function(x) {
>     system(paste("echo -en '# coding=utf-8\nimport unicodedata\nfor 
> i,v in enumerate([\"" ,  paste(x, collapse="\", \""),  "\"]):print 
> unicodedata.normalize(\"NFKD\",unicode(v, 
> \"UTF-8\")).encode(\"UTF-8\")'|python -",  sep=""), intern=T)
> }
>
> Notes to NFKD rountine:
> - only works if R's environment is set to UTF-8!
> - for instance a Danish ø won't be decompose to o / (these cases has 
> to be solved manually)
> - this routine is not very fast
>
>
> Cheers,
>
> --Hans

I don't know if this applies only to Mac or is a general issue. In any 
case, as I am working with Mac now, I will move the discussion to the 
r-sig-mac list as proposed by Brian Ripley. Do you agree?

See you there!

Greetings,

Ricardo

-- 
Ricardo Rodríguez
Your XEN ICT Team