[Rd] sorting bug in R-devel?

Bob Rudis bob @end|ng |rom rud@|@
Tue Jan 19 14:11:35 CET 2021


base::icuSetCollate might be what you need. There are some decent
examples in the manual page on it.

On Tue, Jan 19, 2021 at 7:30 AM Thierry Onkelinx via R-devel
<r-devel using r-project.org> wrote:
>
> Dear Peter,
>
> Thanks for the feedback on the locale. Is there a better alternative for
> the C locale? One that yields a consistent and stable sorting
> independent of the R version and OS.
>
> Best regards,
>
> Thierry
>
> ir. Thierry Onkelinx
> Statisticus / Statistician
>
> Vlaamse Overheid / Government of Flanders
> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
> FOREST
> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
> thierry.onkelinx using inbo.be
> Havenlaan 88 bus 73, 1000 Brussel
> www.inbo.be
>
> ///////////////////////////////////////////////////////////////////////////////////////////
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
> ///////////////////////////////////////////////////////////////////////////////////////////
>
> <https://www.inbo.be>
>
>
> Op di 19 jan. 2021 om 13:20 schreef Peter Dalgaard <pdalgd using gmail.com>:
>
> > Not sure what happened between 4.0.2 and -devel, but you are using C
> > collation, which assumes 7-bit single-byte characters, to sort multi-byte
> > 8-bit encoded characters, which looks a bit risky.
> >
> > -pd
> >
> > > On 19 Jan 2021, at 10:10 , Thierry Onkelinx via R-devel <
> > r-devel using r-project.org> wrote:
> > >
> > > Dear all,
> > >
> > > My git2rdata package relies on a stable sorting. I've noticed that
> > > some characters get a different position under R-devel under Windows
> > > 10. This is why the unit test of my package only fail in this
> > > combination (
> > https://cran.r-project.org/web/checks/check_results_git2rdata.html)
> > >
> > > Below is a minimal example to illustrate the problem.
> > >
> > > Best regards,
> > >
> > > Thierry
> > >
> > > data <- readLines("
> > https://raw.githubusercontent.com/ropensci/git2rdata/master/tests/testthat/test_b_special.R
> > ",
> > > encoding = "UTF-8", n = 15)
> > > eval(parse(text = paste(tail(data, -3), collapse = "")))
> > > ds$a <- enc2utf8(ds$a)
> > > print(ds$a) # input
> > > Sys.setlocale(locale = "C")
> > > print(sort(ds$a)) # sorted
> > > print(order(ds$a)) # order
> > > print(sessionInfo())
> > >
> > > # input
> > > ## Win 10 R 4.0.2
> > > [1] "a"       "a b"     "a\tb"     "a\tb\tc"   "\ta"      "a\t"
> > "a\nb"
> > > [8] "a\nb\nc" "\na"     "a\n"     "a\"b"    "a\"b\"c" "\"b"     "a\""
> > > [15] "\"b\""   "a'b"     "a'b'c"   "'b"      "a'"      "'b'"     "a b c"
> > > [22] "\"NA\""  "'NA'"    NA        "é"       "&"       "à"       "µ"
> > > [29] "ç"       "\200"       "|"       "#"       "@"       "$"
> > > ## Win 10 R devel
> > > [1] "a"       "a b"     "a\tb"     "a\tb\tc"   "\ta"      "a\t"
> > "a\nb"
> > > [8] "a\nb\nc" "\na"     "a\n"     "a\"b"    "a\"b\"c" "\"b"     "a\""
> > > [15] "\"b\""   "a'b"     "a'b'c"   "'b"      "a'"      "'b'"     "a b c"
> > > [22] "\"NA\""  "'NA'"    NA        "é"       "&"       "à"       "µ"
> > > [29] "ç"       "\200"       "|"       "#"       "@"       "$"
> > > ## Ubuntu 18.04 R 4.0.3
> > > [1] "a"       "a b"     "a\tb"    "a\tb\tc" "\ta"     "a\t"     "a\nb"
> > > [8] "a\nb\nc" "\na"     "a\n"     "a\"b"    "a\"b\"c" "\"b"     "a\""
> > > [15] "\"b\""   "a'b"     "a'b'c"   "'b"      "a'"      "'b'"     "a b c"
> > > [22] "\"NA\""  "'NA'"    NA        "é"       "&"       "à"       "µ"
> > > [29] "ç"       "€"       "|"       "#"       "@"       "$"
> > >
> > > # sorted
> > > ## Win 10 R 4.0.2
> > > [1] "\ta"     "\na"     "\"NA\""  "\"b"     "\"b\""   "#"       "$"
> > > [8] "&"       "'NA'"    "'b"      "'b'"     "<U+00B5>" "<U+00E0>"
> > "<U+00E7>"
> > > [15] "<U+00E9>" "<U+20AC>" "@"       "a"       "a\t"     "a\tb"
> > "a\tb\tc"
> > > [22] "a\n"     "a\nb"    "a\nb\nc" "a b"     "a b c"   "a\""     "a\"b"
> > > [29] "a\"b\"c" "a'"      "a'b"     "a'b'c"   "|"
> > > ## Win 10 R devel
> > > [1] "\ta"     "\na"     "\"NA\""  "\"b"     "\"b\""   "#"       "$"
> > > [8] "&"       "'NA'"    "'b"      "'b'"     "@"       "a"       "a\t"
> > > [15] "a\tb"    "a\tb\tc" "a\n"     "a\nb"    "a\nb\nc" "a b"     "a b c"
> > > [22] "a\""     "a\"b"    "a\"b\"c" "a'"      "a'b"     "a'b'c"   "|"
> > > [29] "\200"       "\265"       "\340"       "\347"       "\351"
> > > ## Ubuntu 18.04 R 4.0.3
> > > [1] "\ta"     "\na"     "\"NA\""  "\"b"     "\"b\""   "#"       "$"
> > > [8] "&"       "'NA'"    "'b"      "'b'"     "<U+00B5>" "<U+00E0>"
> > "<U+00E7>"
> > > [15] "<U+00E9>" "<U+20AC>" "@"       "a"       "a\t"     "a\tb"
> > "a\tb\tc"
> > > [22] "a\n"     "a\nb"    "a\nb\nc" "a b"     "a b c"   "a\""     "a\"b"
> > > [29] "a\"b\"c" "a'"      "a'b"     "a'b'c"   "|"
> > >
> > > # order
> > > ## Win 10 R 4.0.2
> > > [1]  5  9 22 13 15 32 34 26 23 18 20 28 27 29 25 30 33  1  6  3  4 10
> > 7  8  2
> > > [26] 21 14 11 12 19 16 17 31 24
> > > ## Win 10 R devel
> > > [1]  5  9 22 13 15 32 34 26 23 18 20 33  1  6  3  4 10  7  8  2 21 14 11
> > 12 19
> > > [26] 16 17 31 30 28 27 29 25 24
> > > ## Ubuntu 18.04 R 4.0.3
> > > [1]  5  9 22 13 15 32 34 26 23 18 20 28 27 29 25 30 33  1  6  3  4 10
> > 7  8  2
> > > [26] 21 14 11 12 19 16 17 31 24
> > >
> > > R version 4.0.2 (2020-06-22)
> > > Platform: x86_64-w64-mingw32/x64 (64-bit)
> > > Running under: Windows 10 x64 (build 18363)
> > >
> > > Matrix products: default
> > >
> > > locale:
> > > [1] C
> > > system code page: 1252
> > >
> > > attached base packages:
> > > [1] stats     graphics  grDevices utils     datasets  methods   base
> > >
> > > loaded via a namespace (and not attached):
> > > [1] compiler_4.0.2 fortunes_1.5-4
> > >
> > > R Under development (unstable) (2021-01-13 r79826)
> > > Platform: x86_64-w64-mingw32/x64 (64-bit)
> > > Running under: Windows 10 x64 (build 18363)
> > >
> > > Matrix products: default
> > >
> > > locale:
> > > [1] C
> > >
> > > attached base packages:
> > > [1] stats     graphics  grDevices utils     datasets  methods   base
> > >
> > > loaded via a namespace (and not attached):
> > > [1] compiler_4.1.0
> > >
> > > R version 4.0.3 (2020-10-10)
> > > Platform: x86_64-pc-linux-gnu (64-bit)
> > > Running under: Ubuntu 18.04.5 LTS
> > >
> > > Matrix products: default
> > > BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
> > > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
> > >
> > > locale:
> > > [1] LC_CTYPE=C                 LC_NUMERIC=C
> > > [3] LC_TIME=C                  LC_COLLATE=C
> > > [5] LC_MONETARY=C              LC_MESSAGES=nl_BE.UTF-8
> > > [7] LC_PAPER=nl_BE.UTF-8       LC_NAME=C
> > > [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > > [11] LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C
> > >
> > > attached base packages:
> > > [1] stats     graphics  grDevices utils     datasets  methods   base
> > >
> > > loaded via a namespace (and not attached):
> > > [1] compiler_4.0.3 fortunes_1.5-4
> > >
> > >
> > > ir. Thierry Onkelinx
> > > Statisticus / Statistician
> > >
> > > Vlaamse Overheid / Government of Flanders
> > > INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
> > > AND FOREST
> > > Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
> > > thierry.onkelinx using inbo.be
> > > Havenlaan 88 bus 73, 1000 Brussel
> > > www.inbo.be
> > >
> > >
> > ///////////////////////////////////////////////////////////////////////////////////////////
> > > To call in the statistician after the experiment is done may be no
> > > more than asking him to perform a post-mortem examination: he may be
> > > able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
> > > The plural of anecdote is not data. ~ Roger Brinner
> > > The combination of some data and an aching desire for an answer does
> > > not ensure that a reasonable answer can be extracted from a given body
> > > of data. ~ John Tukey
> > >
> > > ______________________________________________
> > > R-devel using r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Office: A 4.23
> > Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list