[R] a question of alphabetical order

Wed Apr 16 10:49:54 CEST 2008

Hi,

as already mentioned, sorting could be a pain.

My solution to that is to write my own "order" routine for a given  
language.
The idea is to transform the UTF-8 string into ASCII in such a way  
that the built-in order routine outputs the desired result. But this  
could be a very stony way.

Example for Spanish (please correct me if I'm wrong):
-accents are ignored
-ll is one single entity and comes after l (ludar comes before llave)
-ch is one single entity and comes after c

The only thing I do not know if it could happen that a 'll' is not one  
entity but two (maybe the result of the combination of two nouns). If  
so then the entire story will be much more complicated.

Now the big question is how to delete all these accents in åàÿñü etc.  
to get aaynu. (technically spoken canonical decomposition of a Unicode  
string NFKD)
One possible way is to use a scripting language which can handle it.  
The only language I know  which can do it as default is python. For  
ruby, perl one has to install an additional library.

On a Mac system python is installed as default; on Windows not. If  
this ordering is also an issue for Windows users then one has to  
install it in beforehand.

The code comes here:

orderES <- function(x) {
     #decomposes all accented characters
     str <- NKFD(x)

     #all combining diacritics
     nonChars <- c(768:879)
     pattern <- paste("[", intToUtf8(as.integer(nonChars)), "]", sep="")

     #delete all combining diacritics
     str <- gsub(pattern, "", str)

     #transform ll an ch to l{ and c{ ({ comes after z)
     str <- gsub("ll", "l{", gsub("ch", "c{", str))
     order(str)
}

NKFD <- function(x) {
     system(paste("echo -en '# coding=utf-8\nimport unicodedata\nfor  
i,v in enumerate([\"" ,  paste(x, collapse="\", \""),  "\"]):print  
unicodedata.normalize(\"NFKD\",unicode(v,  
\"UTF-8\")).encode(\"UTF-8\")'|python -",  sep=""), intern=T)
}

Notes to NFKD rountine:
- only works if R's environment is set to UTF-8!
- for instance a Danish ø won't be decompose to o / (these cases has  
to be solved manually)
- this routine is not very fast

Cheers,

--Hans