[Rd] Character encodings and packages

Sun Jan 27 12:49:29 CET 2008

Since R 2.5.0 it has been possible to declare the encodings of character 
strings (at the level of individual elements of a character vector).
As a reminder, here is the announcement in NEWS

     o	R now attempts to keep track of character strings which are
 	known to be in Latin-1 or UTF-8 and print or plot them
 	appropriately in other locales.	 This is primarily intended
 	to make it possible to use data in Western European languages
 	in both Latin-1 and UTF-8 locales.  Currently scan(),
 	read.table(), readLines(), parse() and source() allow
 	encodings to be declared, and console input in suitable
 	locales is also recognized.

 	New function Encoding() can read or set the declared encodings
 	for a character vector.

Whereas R itself is careful to make use of this, I see very little 
recognition of it in packages -- which need to be making use of 
translateChar() rather than CHAR(): see the 'Writing R Extensions' manual. 
(I see it used in only one package, and that mainly in a copy of base R 
code.)

This will become more important as time goes by and more ways are 
introduced to generate marked data.  In particular, in R 2.7.0 under 
Windows 'Unicode' data (as used by NT-based versions of Windows, usually 
UCS-2 but possibly UTF-16) is translated to UTF-8 and marked as such.

In essence, every time you use CHAR() in .Call/.External call in a package 
you should consider if the data can be non-ASCII and if so how you want to 
handle it.  The choices are

- to replace CHAR() by translateChar() and handle the string in the native 
encoding of the current locale.  This needs the package to depend on
'R (>= 2.5.0)'.

- to note the declared encoding and handle the string in that encoding.

- to translate the string to UTF-8 and handle it in UTF-8.  This will be 
easiest to do in R >= 2.7.0 using the function translateCharUTF8().

For writers of graphics devices where is a further twist in R >= 2.7.0: 
currently text is passed to the graphics device in the native encoding, 
but by setting the DevDesc variable hasTextUTF8 to TRUE you can indicate 
to the graphics engine the ability to accept text in UTF-8.  This is done 
in several of the standard devices: for example windows() was already 
re-encoding to UCS-2 for plotting, and postscript()/pdf() also re-encode 
to the selected single-byte encoding.

Character data passed to .C or .Fortran is automatically re-encoded to the 
current locale (for .C, from the encoding specified by ENCODING=, 
otherwise from the declared encoding if any).

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595