[R] using non-ASCII strings in R packages

Thu Jan 25 10:17:48 CET 2007

On Thu, 25 Jan 2007, Bojanowski, M.J.  (Michal) wrote:

> Hello dear useRs and wizaRds,
>
> I am currently developing a package that will enable to use 
> administrative map of Poland in R plots. Among other things I wanted to 
> include region names in proper Polish language so that they can be used 
> in creating graphics etc. I am working on Windows and when I build the 
> package it is complaining about non-ASCII characters R code files.
>
> I was wondering what would be the best way to properly implement them in 
> a platform-independent way so that they can be used in computations as 
> well as in producing PS, PDF and other graphic output. Unfortunately I 
> have a limited knowledge of encoding schemes etc. Is it OK to include 
> them in Windows-1250 encoding (default for Polish locale, as far as I 
> know)? I believe this problem is frequently confronted for other 
> "non-latin1" languages.

Well, infrequently, and it has been answered a few times before (including 
in my talk at UseR 2006, 
http://www.r-project.org/useR-2006/Slides/Ripley.pdf).

> If it is not the way to go, I would be very grateful for suggestions.

Since a Japanese-language Windows machine cannot reproduce Polish 
non-ASCII characters, the portability you seek is not possible for reasons 
outside R.  And many other systems cannot plot in both Polish and their 
native language, or at least not in the same font.

ISOLatin2 is the standard 8-bit encoding for Polish: Windows CP1250 is a 
superset, AFAIR.  If all your users are using an 8-bit Polish locale, 
ISOLatin2 would be safe, but not otherwise.  Even then, there is no 
guarantee that the Polish characters would be in the fonts used in 
PostScript and PDF: some fonts only cover ISOLatin1.

There is one thing you can do to make this a little more portable (and 
avoid the warnings).  If you store the strings concerned in a text file in 
ISOLatin2, and read them into R at run time (e.g. when your package is 
loaded), you can make use of file(encoding=) or iconv() to convert them to 
the current encoding.  That will succeed in ISOLatin2 or CP1250 or UTF-8 
locales and fail otherwise.

Unfortunately that is not the end of the story for users of UTF-8 locales. 
as postscript() and pdf() do not support UTF-8 (as the graphics languages 
do not) and need to be told to use encoding="ISOLatin2.enc", and the X11 
system has a mind of its own and may not show non-ASCII characters in some 
fonts (or worse, render them incorrectly).

The use of Unicode was supposed to reduce the impact of Babel.  But
implementation split into two camps (Windows with UCS-2 and Unix-alikes 
with UTF-8) and some important players (e.g. Adobe) have ignored it, so it 
has only been a very partial solution.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595