[Rd] use of UTF-8 \uxxxx escape sequences in function arguments

Fri Jan 20 13:52:36 CET 2012

On Friday 20 January 2012, Simon Urbanek wrote:
> On Jan 19, 2012, at 6:39 PM, Thomas Zumbrunn wrote:
> > On Thursday 19 January 2012, peter dalgaard wrote:
> >> On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote:
> >>>  plain("Zürich")  ## works
> >>>  plain("Z\u00BCrich")  ## fails
> >>>  escaped("Zürich")  ## fails
> >>>  escaped("Z\u00BCrich")  ## works
> >> 
> >> Using the correct UTF-8 code helps quite a bit:
> >> 
> >> U+00BC	¼	c2 bc	VULGAR FRACTION ONE QUARTER
> >> U+00FC	ü	c3 bc	LATIN SMALL LETTER U WITH DIAERESIS
> > 
> > Thank you for pointing that out. How embarrassing - I systematically used
> > the wrong representations. Even worse, I didn't carefully read "Writing
> > R Extensions" which speaks of "Unicode as \uxxxx escapes" rather than
> > "UTF-8 as \uxxxx escapes", so e.g. looking up the UTF-16 byte
> > representations would have done the trick.
> > 
> > I didn't find a recommended method of replacing non-ASCII characters with
> > Unicode \uxxxx escape sequences and ended up using the Unix command line
> > tool "iconv". However, the iconv version installed on my GNU/Linux
> > machine (openSUSE 11.4) seems to be outdated and doesn't support the
> > very useful "-- unicode-subst" option yet. I installed "libiconv" from
> > http://www.gnu.org/software/libiconv/, and now I can easily replace all
> > non-
> > 
> > ASCII characters in my UTF-8 encoded R files with:
> >  iconv -f UTF-8 -t ASCII --unicode-subst="\u%04X" my-utf-8-encoded-file.R
> 
> You can actually do that with R alone:
> 
> ## you'll have to make sure that you're in C locale so R does the conversion for you
> Sys.setlocale(,"C")
> 
> utf8conv <- function(conn)
> gsub("<U\\+([0-9A-F]{4})>","\\\\u\\1",capture.output(writeLines(readLines(conn,encoding="UTF-8"))))
> 
> > writeLines(utf8conv("test.txt"))
> 
> M\u00F6gliche L\u00F6sung
> ne nebezpe\u010Dn\u00E9
> 
> Cheers,
> Simon

Thanks for the above function (which I wouldn't have managed to construct, ever...). Maybe this is worth mentioning in the 
"Writing R Extensions" manual (next to where the \uxxxx Unicode escape sequences are mentioned).

Thomas