[R] difference in sort order linux/Windows (R.2.11.0)

(Ted Harding) Ted.Harding at manchester.ac.uk
Fri May 28 11:55:36 CEST 2010


On 28-May-10 08:17:49, carslaw wrote:
> Dear R users,
> 
> I'm a bit perplexed with the effect sort has here, as it is different
> on Windows vs. linux. 
> It makes my factor levels and subsequent plots different on the two
> systems.
> 
> Given:
> 
> types <- c("PC-D-Euro-0", "PC-D-Euro-1", "PC-D-Euro-2", "PC-D-Euro-3", 
> "PC-D-Euro-4", "PC-D-Euro-5", "PC-D-Euro-6", "LCV-D-Euro-0", 
> "LCV-D-Euro-1", "LCV-D-Euro-2", "LCV-D-Euro-3", "LCV-D-Euro-4", 
> "LCV-D-Euro-5", "LCV-D-Euro-6", "HGV-D-Euro-0", "HGV-D-Euro-I", 
> "HGV-D-Euro-II", "HGV-D-Euro-III", "HGV-D-Euro-IV EGR", "HGV-D-Euro-IV
> SCR", 
> "HGV-D-Euro-IV SCRb", "HGV-D-Euro-V EGR", "HGV-D-Euro-V SCR", 
> "HGV-D-Euro-V SCRb", "HGV-D-Euro-VI", "HGV-D-Euro-VIb")
> 
> On linux, sort does:
> 
> sort(types)
>  [1] "HGV-D-Euro-0"       "HGV-D-Euro-I"       "HGV-D-Euro-II"     
>  [4] "HGV-D-Euro-III"     "HGV-D-Euro-IV EGR"  "HGV-D-Euro-IV SCR" 
>  [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR"   "HGV-D-Euro-VI"     
> [10] "HGV-D-Euro-VIb"     "HGV-D-Euro-V SCR"   "HGV-D-Euro-V SCRb" 
> [13] "LCV-D-Euro-0"       "LCV-D-Euro-1"       "LCV-D-Euro-2"      
> [16] "LCV-D-Euro-3"       "LCV-D-Euro-4"       "LCV-D-Euro-5"      
> [19] "LCV-D-Euro-6"       "PC-D-Euro-0"        "PC-D-Euro-1"       
> [22] "PC-D-Euro-2"        "PC-D-Euro-3"        "PC-D-Euro-4"       
> [25] "PC-D-Euro-5"        "PC-D-Euro-6"
> 
> 
> And on Windows:
> 
> sort(types)
> 
>  [1] "HGV-D-Euro-0"       "HGV-D-Euro-I"       "HGV-D-Euro-II"    
>  [4] "HGV-D-Euro-III"     "HGV-D-Euro-IV EGR"  "HGV-D-Euro-IV SCR"
>  [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR"   "HGV-D-Euro-V SCR" 
> [10] "HGV-D-Euro-V SCRb"  "HGV-D-Euro-VI"      "HGV-D-Euro-VIb"   
> [13] "LCV-D-Euro-0"       "LCV-D-Euro-1"       "LCV-D-Euro-2"     
> [16] "LCV-D-Euro-3"       "LCV-D-Euro-4"       "LCV-D-Euro-5"     
> [19] "LCV-D-Euro-6"       "PC-D-Euro-0"        "PC-D-Euro-1"      
> [22] "PC-D-Euro-2"        "PC-D-Euro-3"        "PC-D-Euro-4"      
> [25] "PC-D-Euro-5"        "PC-D-Euro-6"      
> 
> Session info for both systems is below.  The order I actually want is
> the
> Windows one, but looking at it,
>  the linux order is perhaps more intuitive.  However, the problem is
> the
> order is inconsistent between
>  the two systems.  Any suggestions?
> 
> sessionInfo()
> R version 2.11.0 (2010-04-22) 
> x86_64-pc-linux-gnu 
> 
> locale:
>  [1] LC_CTYPE=en_GB.utf8          LC_NUMERIC=C                
>  [3] LC_TIME=en_GB.utf8           LC_COLLATE=en_GB.utf8       
>  [5] LC_MONETARY=en_GB.utf8       LC_MESSAGES=en_GB.utf8      
>  [7] LC_PAPER=en_GB.utf8          LC_NAME=en_GB.utf8          
>  [9] LC_ADDRESS=en_GB.utf8        LC_TELEPHONE=en_GB.utf8     
> [11] LC_MEASUREMENT=en_GB.utf8    LC_IDENTIFICATION=en_GB.utf8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base   
> 
> other attached packages:
> [1] rkward_0.5.3
> 
> loaded via a namespace (and not attached):
> [1] tools_2.11.0
> 
>> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-pc-mingw32
> 
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252
> [2] LC_CTYPE=English_United Kingdom.1252  
> [3] LC_MONETARY=English_United Kingdom.1252
> [4] LC_NUMERIC=C                          
> [5] LC_TIME=English_United Kingdom.1252   
> 
>  
> attached base packages:
> 
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> Dr David Carslaw

I suspect the result (in Linux, I can't test this on Windows)
may be related to the following phenomenon:

  sort(c("AB CD","ABCD"))
  # [1] "ABCD"  "AB CD"
  sort(c("AB CD","ABCD "))
  # [1] "AB CD" "ABCD "

I.e. "ABCD" precedes "AB CD" apparently because it is shorter,
despite the fact that it would come later in an alphabetical sort.
If I use the Linux 'sort' command (on the same machine) I get:

sort << EOT
"AB CD"
"ABCD"
EOT
"AB CD"
"ABCD"

sort << EOT
"AB CD"
"ABCD "
EOT
"AB CD"
"ABCD "

I.e. the same result for either case. In my view the R result is
anomalous! In ?Comparison it is stated that characters are translated
to UTF8 before conparison is done; so a possible explanation could
be that the UTF8 encoding for SPACE (for all I know) may be greater
than that for the letters of the alphabet (as opposed to ASCII, where
-- I do know -- it is less). And, if that is the case, why doesn't it
apply also in Windows? This strikes me as a nasty little trap!

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 28-May-10                                       Time: 10:55:33
------------------------------ XFMail ------------------------------



More information about the R-help mailing list