[R] difference in sort order linux/Windows (R.2.11.0)

Steven Lembark lembark at wrkhors.com
Sun May 30 17:20:26 CEST 2010


On Fri, 28 May 2010 01:17:49 -0700 (PDT)
carslaw <david.carslaw at kcl.ac.uk> wrote:

>  [4] "HGV-D-Euro-III"     "HGV-D-Euro-IV EGR"  "HGV-D-Euro-IV SCR" 
>  [4] "HGV-D-Euro-III"     "HGV-D-Euro-IV EGR"  "HGV-D-Euro-IV SCR"

>  [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR"   "HGV-D-Euro-VI"     
>  [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR"   "HGV-D-Euro-V SCR" 

This is a lexical sort. Depending on the locale the
items may not sort in ASCII order. For example, a 
European-latin locale may have some letters in 
different places than ASCII. You have to check 
what is being sorted (e.g., map the stuff to UTF8
binary).

You might also find that input generated on windog
has "smart spaces" in it from the generating program
(e.g., Excell) that are something like \xA0 instead
of \x20 (32d) used in ASCII spaces.

Suggestion: Validate the data with something like
"od -cx" on linux so you know what you are sorting.
Then dump it out as hex in R [sorry, I have no idea
how to do that] and see if what you are sorting 
matches. After that validate the LOCALE setting
on both sides. If all of those turn up the same 
raw data then you've found a bug in R -- or at least
need to read some fine print in the lexical sort
docs.

-- 
Steven Lembark                                          85-09 90th St.
Workhorse Computing                               Woodhaven, NY, 11421
lembark at wrkhors.com                                    +1 888 359 3508



More information about the R-help mailing list