[Rd] encoding issues even w/o accents (background on single quotes)

Fri Jan 19 20:39:49 CET 2007

On Wed, Jan 17, 2007 at 11:56:15PM -0800, Ross Boylan wrote:
> An earlier thread (in 10/2006) discussed encoding issues in the
> context of R data and the desire to represent accented characters.
> 
> It matters in another setting: the output generated by R and the
> seemingly order character "'" (single quote).  In particular, R CMD
            ^^^ should be "ordinary"
> check runs test code and compares the generated output to a saved file
> of expected output.  This does not work reliably across encoding
> schemes.  This is unfortunate, since it seems the "expected output"
> files will necessarily be wrong for someone.
> 
> The problem for me was triggered by the single-quote character "'".
> On my older systems, this is encoded by 0x27, a perfectly fine ASCII
> character.  That is on a Debian GNU/Linux system with LANG=en_US.  On
> a newer system I have LANG=en_US.UTF-8.  I don't recall whether
> this was a deliberate choice on my part, or simply reflects changing
> defaults for the installer.  (Note the earlier thread referred to the
> Debian-derived Ubuntu systems as having switched to UTF-8).  Under
> UTF-8 the same character is encoded in the 3-byte sequence 0xE28098
> (which seems odd; I thought the point of UTF-8 was that ASCII was a
> legitimate subset).

Apparently quoting, particularly single quotes, is a can of worms:
http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
When Unicode is available (which would be the case with UTF-8),
particular non-ASCII characters are recommended for single quoting.
The 3 byte sequence is the UTF-8 encoding of x2018, the recommended
left single quote mark.

See http://en.wikipedia.org/wiki/UTF-8 on UTF-8 encoding.

This is more than I or, probably, you ever wanted to know about this
issue!

Ross

> 
> The coefficient  printing methods in the stats package use the
> single-quote in the key explaining significance levels:
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> 
> I suppose one possible work-around for R CMD check would be to set the
> encoding to  some standard value before it runs tests, but that has
> some drawbacks.  It doesn't work for packages needing a different
> encoding (but perhaps the package could specify an encoding to use by
> default?)(*),  It will leave the output files looking weird on systems
> with a different encoding.  It will get messed up if one generates the
> files under the wrong encoding.
> 
> And none of this addresses stuff beyond the context of output file
> comparison in R CMD check.
> 
> Any thoughts?
> 
> Ross Boylan
> 
> 
> * From the R Extensions document, discussing the DESCRIPTION file:
>    If the `DESCRIPTION' file is not entirely in ASCII it should contain
> an `Encoding' field specifying an encoding.  This is currently used as
> the encoding of the `DESCRIPTION' file itself, and may in the future be
> taken as the encoding for other documentation in the package.  Only
> encoding names `latin1', `latin2' and `UTF-8' are known to be portable.
> 
> I would not expect that the test output files be considered
> "documentation," but I suppose that's subject to interpretation.