kry|ov@r00t @end|ng |rom gm@||@com
Mon Jan 31 12:32:01 CET 2022
On Mon, 31 Jan 2022 09:56:27 +0000
"Blätte, Andreas" <andreas.blaette using uni-due.de> wrote:
> After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8
> R`, the output of `localeToCharset()` is:
>  "UTF-8" "ISO8859-1"
> why ISO8859-1 might be a fallback option here?
ISO8859-1 seems to be offered because it covers the alphabet of
American English. Obviously, this doesn't guarantee that the guess is
correct. For example, I could symlink the ru_RU.KOI8-R locale on my
system to name it "ru_RU", and localeToCharset() would return
"ISO8859-5", not knowing the correct answer. їЯавЯг, anyone?
> Part of my analysis of the code of `localeToCharset()` is that it
> targets special scenarios on Windows and macOS, but not on Linux.
Well, it almost does the right thing. GNU/Linux locales are typically
named like <language>_<country>.<encoding>, and localeToCharset()
respects the <encoding> part, but only if the language and the country
are specified. A quick fix for that would be to add one final case:
--- src/library/utils/R/iconv.R (revision 81596)
+++ src/library/utils/R/iconv.R (working copy)
@@ -135,6 +135,7 @@
if(enc == "utf8") return(c("UTF-8", guess(ll)))
+ if (enc == "utf8") return("UTF-8") # fallback for ???.UTF-8
(Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc)
&& enc != "utf8") branch.)
Maybe a better fix would be to restructure the code a bit, to always
take the encoding hint and then also try to guess if the locale looks
like it provides a language code.
More information about the R-devel