[Rd] \U with more than 4 digits returns the wrong character

Duncan Murdoch murdoch.duncan at gmail.com
Thu Dec 4 22:21:47 CET 2014


On 04/12/2014, 2:00 PM, Richard Cotton wrote:
> If I type a character using \U syntax that has more than 4 digits, I
> get the wrong character.  For example,
> 
> "\U1d4d0"
> 
> should print a mathematical bold script capital A.  See
> http://www.fileformat.info/info/unicode/char/1d4d0/index.htm
> 
> On my machine, it prints the Hangul character corresponding to
> 
> "\Ud4d0"
> http://www.fileformat.info/info/unicode/char/d4d0/index.htm
> 
> It seems that the hex-digit part is overflowing at 16^4.
> 
> I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under
> Windows.  I played around with Sys.setlocale and options("encoding"),
> but couldn't get the expected value.
> 
> Can others reproduce this?  It feels like a bug, but experience tells
> me I probably have something silly going on with my setup.
> 

The issue is that on Windows, the wchar_t in our C code is a 16 bit
value.  In the old days, Windows only supported 16 bit characters.
Since Windows 2000 they've supported the full Unicode range (which I
think is currently 20 or 21 bits) using the UTF-16 encoding, but our
internal code is still assuming a 16 bit limit.

I'll submit the bug report on this and hopefully will get to it before
the next release, but it's tricky to catch all the possible places where
the upper bits get lost, given the dumb Windows convention that wchar_t
is 16 bits.

Duncan Murdoch



More information about the R-devel mailing list