[R] UTF-8 to the console

Thu Oct 13 18:03:30 CEST 2022

On 6/23/22 13:36, Ivan Krylov wrote:
> On Thu, 23 Jun 2022 12:26:23 +0200
> Helmut Schütz <helmut.schuetz using bebac.at> wrote:
>
>> txt <- "x ≥ y, x \u2265 y; a ≈ b, a \u2248 b"
>> Encoding(txt) <- "UTF-8"
> There shouldn't be a need to change the encoding. If you're creating a
> Unicode literal, R should already choose UTF-8 for the resulting
> string. Either way, R automatically converts the strings from their
> source encoding on output.
>
> Moreover, `Encoding<-` doesn't perform any conversion, it only changes
> the declared encoding on the string, affecting the way it may be
> encoded or decoded in the future. If Encoding(txt) wasn't already UTF-8,
> you would likely be damaging the data:
>
> string <- 'Ы' # is already UTF-8
> # No conversion happens, the same bytes re-interpreted differently
> Encoding(string) <- 'latin1'
> string
> # [1] "Ð«"
>
>> R 4.2.0 on Windows 7
> On Windows 7, Rterm will stay limited to the OEM encoding, since UCRT
> only supports UTF-8 locales on Windows ≥ 10, version 1903. If your OEM
> encoding doesn't have the ≥, ≈ characters, printing them to the console
> is going to be hard. Not impossible -- e.g. an R extension written in C
> could obtain a handle to the current console and use Unicode-aware
> Windows API to print these characters -- but just getting it to work
> would be hard, and it will be likely unportable.
>
>> and Windows 11.
> I think it should be possible. What does system('chcp') say in your
> Rterm session?
>
> For console UTF-8 output to work, two things should happen:
>
> 1. The console must be using UTF-8, i.e. chcp must say it's using code
> page 65001.
>
> 2. Rterm must understand that and also use UTF-8 on output.
>
> What does sessionInfo() and l10n_info() say in your Rterm session on
> Windows 11? In Rterm source code, I see a check for GetACP() == 65001,
> which should have switched the console encoding to UTF-8 automatically.
>
> Perhaps you need to run chcp 65001 before starting Rterm? Maybe you
> need to set a checkbox [*] to make the ANSI codepage UTF-8 by default?
> I'm not sure any of this is going to work, but it's something to try
> before someone more knowledgeable with R on Windows can help you.

You are right, R already tries to set the console code page itself to 
65001, so chcp is no longer needed to change it.

On Windows 7 UTF-8 would not be used by R because it can't be the 
"system encoding" (ACP), and I suppose Helmut's output was from that 
version of Windows:

x = y, x = y; a ˜ b, a ˜ b

This suggests transliteration ("best-fit") of the characters not 
representable in the session encoding, done by Windows. On Windows 7, 
characters not representable in the user locale encoding will not be 
usable in Rterm, there is no way around that, but one can e.g. use Rgui.

On new Windows, such as Windows 11, there was a bug in Rterm as I 
reported in the other email, fixed now. The output on my Windows 10 was:

x  y, x ≥ y; a  b, a ≈ b

So the characters pasted were missing, but the characters expressed via 
\u escapes were printed correctly. This was a problem between the 
windows console implementation used in cmd.exe/powershell and 
Rterm/getline. What is characteristic for these problems is that the 
behavior differs from when Rterm runs from the Windows Terminal 
application (or possibly Msys2 mintty/winpty/bash).

Best
Tomas

>