[Rd] R on Windows with UCRT and the system encoding

Tue Dec 21 15:47:51 CET 2021

Hi Tomas,

Thank you very much for the detailed explanation! I think now I have a
bit better understanding on how the things work; at least now I know I
didn't understand the concept of "active code page". I'll follow your
advice when I need to fix the packages that need some tweaks to handle
UTF-8 properly.

Sorry, I'd like to ask one more question related to locale. If I copy
the following text and execute `read.csv("clipboard")`, it returns
"uao" instead of "úáö" (the characters are transliterated).

    "col1","col2"
    "úáö","úáö"

While this is probably the status quo (the same behavior on R 4.1) on
Latin-1 encoding, things are worse on CJK locales. If I try,

    "col1","col2"
    "あ","い"

I get the following error:

    > read.csv("clipboard")
    Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec,  :
      invalid multibyte string at '<82><a0>'

Is this supposed to work? It seems the characters are encoded as CP932
(my system locale) but marked as UTF-8.

    > x <- utils:::readClipboard()
    > x
    [1] "\"col1\",\"col2\""         "\"\x82\xa0\",\"\x82\xa2\""
    > iconv(x, from = "CP932", to = "UTF-8")
    [1] "\"col1\",\"col2\"" "\"あ\",\"い\""

I read the source code of readClipboard() in
src/library/utils/src/windows/util.c, but have no idea if there's
anything that needs to be fixed.

Best,
Yutani

2021年12月21日(火) 17:26 Tomas Kalibera <tomas.kalibera using gmail.com>:

>
> Hi Yutani,
>
> On 12/21/21 6:34 AM, Hiroaki Yutani wrote:
> > Hi,
> >
> > I'm more than excited about the announcement about the upcoming UTF-8
> > R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
> > work on Windows with non-UTF-8 encoding as the system locale? I think
> > this blog post indicates so (as this describes the older Windows than
> > the UTF-8 era), but I'm not fully confident if I understand the
> > details correctly.
>
> R 4.2 will automatically use UTF-8 as the active code page (system
> locale) and the C library encoding and the R current native encoding on
> systems which allow this (recent Windows 10 and newer, Windows Server
> 2022, etc). There is no way to opt-out from that, and of course no
> reason to, either. It does not matter of what is the system locale set
> in Windows for the whole system - these recent Windows allow individual
> applications to override the system-wide setting to UTF-8, which is what
> R does. Typically the system-wide setting will not be UTF-8, because
> many applications will not work with that.
>
> On older systems, R 4.2 will run in some other system locale and the
> same C library encoding and R current native encoding - the same system
> default as R 4.1 would run on that system. So for some time, encoding
> support for this in R will have to stay, but eventually will be removed.
> But yes, R 4.2 is still supposed to work on such systems.
>
> > https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html
> >
> > If so, I'm curious what the package authors should do when the locales
> > are different between OS and R. For example (disclaimer: I don't
> > intend to blame processx at all. Just for an example), the CRAN check
> > on the processx package currently fails with this warning on R-devel
> > Windows.
> >
> >>      1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte character at end of stream ignored
> > https://cran.r-project.org/web/checks/check_results_processx.html
> >
> > As far as I know, processx launches an external process and captures
> > its output, and I suspect the problem is that the output of the
> > process is encoded in non-UTF-8 while R assumes it's UTF-8. I
> > experienced similar problems with other packages as well, which
> > disappear if I switch the locale to the same one as the OS by
> > Sys.setlocale(). So, I think it would be great if there's some
> > guidance for the package authors on how to handle these properly.
>
> Incidentally I've debugged this case and sent a detailed analysis to the
> maintainer, so he knows about the problem.
>
> In short, you cannot assume in Windows that different applications use
> the same system encoding. That is not true at least with the invention
> of the fusion manifests which allow an application to switch to UTF-8 as
> system encoding, which R does. So, when using an external application on
> Windows, you need to know and respect a specific encoding used by that
> application on input and output.
>
> As an example based on processx, you have an application which prints
> its argument to standard output. If you do it this way:
>
> $ cat pr.c
> #include <stdio.h>
> #include <locale.h>
> #include <string.h>
> int main(int argc, char **argv) {
>
>          printf("Locale set to: %s\n", setlocale(LC_ALL, ""));
>          int i;
>          for(i = 0; i < argc; i++) {
>                  printf("Argument %d\n", i);
>                  printf("%s\n", argv[i]);
>                  for(int j = 0; j < strlen(argv[i]); j++) {
>                          printf("byte[%d] is %x (%d)\n", i, (unsigned
> char)argv[i][j], (unsigned char)
>                  }
>          }
>          return 0;
> }
>
> the argument and hence output will be in the current native encoding of
> pr.c, because that's the encoding in which the argument will be received
> from Windows, so by default the system locale encoding, so by default
> not UTF-8 (on my system in Latin-1, as well as on CRAN check systems).
> One should also only use such programs with characters representable in
> Latin-1 on such systems. When you call such application from R with
> UTF-8 as native encoding, Windows will automatically convert the
> arguments to Latin-1.
>
> The old Windows way to avoid this problem is to use the wide-character
> API (now UTF-16LE):
>
> $ cat prw.c
> #include <stdio.h>
> #include <locale.h>
> #include <string.h>
>
> int wmain(int argc, wchar_t **argv) {
>
>          int i;
>          for(i = 0; i < argc; i++) {
>                  wprintf(L"Argument %d\n", i);
>                  wprintf(argv[i]);
>                  wprintf(L"\n");
>                  for(int j = 0; j < wcslen(argv[i]); j++)
>                          wprintf(L"Word[%d] %x\n", j,
> (unsigned)argv[i][j]);
>          }
>          return 0;
> }
>
> When you call such program from R with UTF-8 as native encoding, Windows
> will convert the arguments to UTF-16LE (so all characters will be
> representable). But you need to write Windows-specific code for this.
>
> The new Windows way to avoid this problem is to use UTF-8 as the native
> encoding via the fusion manifest, as R does. You can use the "pr.c" as
> above, but with something like
>
> $ cat pr.rc
> #include <windows.h>
> CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"
>
> $ cat pr.manifest
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
> <assemblyIdentity
>      version="1.0.0.0"
>      processorArchitecture="amd64"
>      name="pr.exe"
>      type="win32"
> />
> <application>
>    <windowsSettings>
>      <activeCodePage
> xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
>    </windowsSettings>
> </application>
> </assembly>
>
> windres.exe -i pr.rc -o pr_rc.o
> gcc -o pr pr.c pr_rc.o
>
> When you build the application this way, it will use UTF-8 as native
> encoding, so when you call it from R (with UTF-8) as native encoding, no
> input conversion will occur. However, when you do this, the output from
> the application will also be in UTF-8.
>
> So, for applications you control, my recommendation would be to make
> them use Unicode one of these two ways. Preferably the new one, with the
> fusion manifest. Only if it were a Windows-only application, and had to
> work on older Windows, then the wide-character version (but such apps
> are probably not in R packages).
>
> When working with external applications you don't control, it is harder
> - you need to know which encoding they are expecting and producing, in
> whatever interface you use, and convert that, e.g. using iconv(). By the
> interface I mean that e.g., the command-line arguments are converted by
> Windows, but the input/output sent over a file/stream will not be.
>
> Of course, this works the other way around as well. If you were using R
> with some other external applications expecting a different encoding,
> you would need to handle that (by conversions). With applications you
> control, it would make sense using this opportunity to switch to UTF-8.
> But, in principle, you can use iconv() from R directly or indirectly to
> convert input/output streams to/from a known encoding.
>
> I am happy to give more suggestions if there is interest, but for that
> it would be useful to have a specific example (with processx, it is
> clear what the options R, there the application is controlled by the
> package).
>
> Best
> Tomas
> >
> > Any suggestions?
> >
> > Best,
> > Yutani
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel