[R] using non-ASCII strings in R packages
Prof Brian Ripley
ripley at stats.ox.ac.uk
Thu Jan 25 10:17:48 CET 2007
On Thu, 25 Jan 2007, Bojanowski, M.J. (Michal) wrote:
> Hello dear useRs and wizaRds,
> I am currently developing a package that will enable to use
> administrative map of Poland in R plots. Among other things I wanted to
> include region names in proper Polish language so that they can be used
> in creating graphics etc. I am working on Windows and when I build the
> package it is complaining about non-ASCII characters R code files.
> I was wondering what would be the best way to properly implement them in
> a platform-independent way so that they can be used in computations as
> well as in producing PS, PDF and other graphic output. Unfortunately I
> have a limited knowledge of encoding schemes etc. Is it OK to include
> them in Windows-1250 encoding (default for Polish locale, as far as I
> know)? I believe this problem is frequently confronted for other
> "non-latin1" languages.
Well, infrequently, and it has been answered a few times before (including
in my talk at UseR 2006,
> If it is not the way to go, I would be very grateful for suggestions.
Since a Japanese-language Windows machine cannot reproduce Polish
non-ASCII characters, the portability you seek is not possible for reasons
outside R. And many other systems cannot plot in both Polish and their
native language, or at least not in the same font.
ISOLatin2 is the standard 8-bit encoding for Polish: Windows CP1250 is a
superset, AFAIR. If all your users are using an 8-bit Polish locale,
ISOLatin2 would be safe, but not otherwise. Even then, there is no
guarantee that the Polish characters would be in the fonts used in
PostScript and PDF: some fonts only cover ISOLatin1.
There is one thing you can do to make this a little more portable (and
avoid the warnings). If you store the strings concerned in a text file in
ISOLatin2, and read them into R at run time (e.g. when your package is
loaded), you can make use of file(encoding=) or iconv() to convert them to
the current encoding. That will succeed in ISOLatin2 or CP1250 or UTF-8
locales and fail otherwise.
Unfortunately that is not the end of the story for users of UTF-8 locales.
as postscript() and pdf() do not support UTF-8 (as the graphics languages
do not) and need to be told to use encoding="ISOLatin2.enc", and the X11
system has a mind of its own and may not show non-ASCII characters in some
fonts (or worse, render them incorrectly).
The use of Unicode was supposed to reduce the impact of Babel. But
implementation split into two camps (Windows with UCS-2 and Unix-alikes
with UTF-8) and some important players (e.g. Adobe) have ignored it, so it
has only been a very partial solution.
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help