[Rd] use of UTF-8 \uxxxx escape sequences in function arguments

Thomas Zumbrunn thomas at zumbrunn.name
Fri Jan 20 00:39:18 CET 2012


On Thursday 19 January 2012, peter dalgaard wrote:
> On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote:
> >   plain("Zürich")  ## works
> >   plain("Z\u00BCrich")  ## fails
> >   escaped("Zürich")  ## fails
> >   escaped("Z\u00BCrich")  ## works
> 
> Using the correct UTF-8 code helps quite a bit:
> 
> U+00BC	¼	c2 bc	VULGAR FRACTION ONE QUARTER
> U+00FC	ü	c3 bc	LATIN SMALL LETTER U WITH DIAERESIS

Thank you for pointing that out. How embarrassing - I systematically used the 
wrong representations. Even worse, I didn't carefully read "Writing R 
Extensions" which speaks of "Unicode as \uxxxx escapes" rather than "UTF-8 as 
\uxxxx escapes", so e.g. looking up the UTF-16 byte representations would have 
done the trick.

I didn't find a recommended method of replacing non-ASCII characters with 
Unicode \uxxxx escape sequences and ended up using the Unix command line tool 
"iconv". However, the iconv version installed on my GNU/Linux machine 
(openSUSE 11.4) seems to be outdated and doesn't support the very useful "--
unicode-subst" option yet. I installed "libiconv" from 
http://www.gnu.org/software/libiconv/, and now I can easily replace all non-
ASCII characters in my UTF-8 encoded R files with:

  iconv -f UTF-8 -t ASCII --unicode-subst="\u%04X" my-utf-8-encoded-file.R

Thomas Zumbrunn



More information about the R-devel mailing list