[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Wed Apr 10 18:32:19 CEST 2019

On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>
> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
> > Since it is "technically easy" to disable the best fit conversion and
> > the best fit is rarely good, how about providing an option for
> > code/package authors to disable it? I'm asking because this is one of
> > the most painful issues in packages that may need to source() code
> > containing UTF-8 characters that are not representable in the Windows
> > native encoding. Examples include knitr/rmarkdown and shiny. Basically
> > users won't be able to knit documents or run Shiny apps correctly when
> > the code contains characters that cannot be represented in the native
> > encoding.
>
> Wouldn't things be worse with it disabled than currently?  I'd expect
> the line containing the "ř" to end up as NA instead of converting to "r".

I don't think it would be worse, because in this case R would not
implicitly convert strings to (best fit) latin1 on Windows, but
instead keep the (correct) string in its UTF-8 encoding. The NA only
appears if the user explicitly forces a conversion to latin1, which is
not the problem here I think.

The original problem that I can reproduce in RGui is that if you enter
 "ř" in RGui, R opportunistically converts this to latin1, because it
can. However if you enter text which can definitely not be represented
in latin1, R encodes the string correctly in UTF-8 form.