[Rd] Encoding issues

Mon Feb 18 17:45:14 CET 2019

On 2/18/19 4:36 PM, Iñaki Ucar wrote:
> Hi,
>
> We found a (to our eyes) strange behaviour that might be a bug. First
> a little bit of context. The 'units' package allows us to set the unit
> using both SE or NSE. E.g., these both work in the same way:
>
> units::set_units(1:10, "μm")
> #> Units: [μm]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> units::set_units(1:10, μm)
> #> Units: [μm]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> That's micrometers, and works fine if the session charset is UTF-8.
> Now the funny part comes with Windows. The first version, with quotes,
> works fine, but the second one fails. This is easy to demonstrate from
> Linux:
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, "μm")'
> #> Units: [μm]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, μm)'
> #> Error: unexpected input in "units::set_units(1:10, μ"
> #> Execution halted
>
> However, if you use the first version, with quotes, in an example, and
> the package is checked on Windows, it fails too (see
> https://ci.appveyor.com/project/edzer/units/builds/22440023#L747). The
> package declares UTF-8 encoding, so none of these errors should, in
> principle, happen. Am I wrong?

Hi Iñaki,

if you want to report a bug against R, please try to provide a minimum 
reproducible example that only uses base packages (not units) and please 
also see WRE sections 1.3, 1.6.3, including:

"There is a portable way to have arbitrary text in character strings 
(only) in your R code, which is to supply them in Unicode as ‘\uxxxx’ 
escapes."

"If your package specifies an encoding in its DESCRIPTION file, you 
should run these tools in a locale which makes use of that encoding" 
(includes R CMD check)

Even though there are portable ways to have a string constant literal in 
source code in UTF-8, not representable in the current native encoding 
(e.g. using \u escapes), it does not mean that such a string can be 
freely used in R. Many operations require conversion to the current 
native encoding, which will cause an error or unexpected result. Such 
conversions can happen any time (except when they are documented not to 
happen).

Implementing an API that will work with such strings in a package would 
be hard to get right, but not impossible. NSE will not work 
(non-representable strings, which are not string constant literals, are 
not supported). One can save a lot of headaches by using only ASCII in 
function APIs.

Best
Tomas

>
> Thanks in advance, regards,
> Iñaki
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]