[Rd] localeToCharset()

Mon Jan 31 10:56:27 CET 2022

Dear all,

packages for processing text may need information on the charset of the R session. In my packages RcppCWB and polmineR, I extract this information from the locale using `localeToCharset()`. But when running cross-platform checks (Github Actions and Docker), I recurringly encounter unexpected behavior of `localeToCharset()`.

As a a reproducible example, I suggest to use a local Fedora (latest) container, starting as follows:

docker pull fedora:latest
docker run -it fedora:latest /bin/bash

After installing R (`yum install -y R`) and starting R, `localeToCharset()` returns `NA`. However, the part of sessionInfo() on the locale is as follows:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

If I run R CMD check on any arbitrary package in this environment at this stage, I see:
* using session charset: UTF-8

The documentation says however: ‚In the C locale the answer will be "ASCII".’  Why not UTF-8 in this case?

The `localeToCharset()` function is also confusing for me, when I explicitly re-define the locale. In my fresh Fedora docker container, I need to install English-language locales first:
dnf install langpacks-en

After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8 R`,  the output of `localeToCharset()` is:
[1] "UTF-8"     "ISO8859-1"

The “Value” section of the documentation says: “A character vector naming an encoding and possibly a fallback single-encoding, NA if unknown.”  But I do not understand why ISO8859-1 might be a fallback option here?

I do not know whether this is just a matter of documentation? My intuition is that `localeToCharset()` should work differently. At the moment, I need to rely on a few workarounds to cope with the behavior I do not understand.  (Or is there a better function to detect the encoding of the R session?)

Part of my analysis of the code of `localeToCharset()` is that it targets special scenarios on Windows and macOS, but not on Linux.

Kind regards
Andreas

--
Prof. Dr. Andreas Blaette
Professor of Public Policy and Regional Politics
University of Duisburg-Essen

	[[alternative HTML version deleted]]