[Rd] Native characterset is wrong for unicode builds for Windows

Duncan Murdoch murdoch.duncan at gmail.com
Fri Feb 27 03:13:48 CET 2015

On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
>> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>>> When I send some outlandish characters through enc2native (or format) in
>>> R 3.1.2 on Ubuntu trusty it works quite well:
>>>   > "®ØΔЊת"
>>> [1] "®ØΔЊת"
>>>   > enc2native("®ØΔЊת")
>>> [1] "®ØΔЊת"
>>>   > Encoding(enc2native("®ØΔЊת"))
>>> [1] "UTF-8"
>>> In Windows the result is different:
>>>   > "®ØΔЊת"
>>> [1] "®ØΔЊת"
>>>   > enc2native("®ØΔЊת")
>>> [1] "®Ø<U+0394><U+040A><U+05EA>"
>>>   > Encoding(enc2native("®ØΔЊת"))
>>> [1] "latin1"
>>> And this is wrong. The native character set of a unicode application
>>> under Windows is *Unicode*. enc2native should do the same under Windows
>>> as it does on Ubuntu. Also the "unknown" encoding should be changed to
>>> mean the same as "UTF-8" exactly as it is on Linux.
>> What is a "unicode application", and what makes you think R is one?  R
>> is being told by Windows that your native encoding is latin1.  Perhaps
>> Windows 8 supports UTF-8 as a native encoding (I've never used it), but
>> previous versions of Windows didn't.
>> Duncan Murdoch
> A unicode application is a program that uses the unicode API of Windows 

R uses those functions, so I guess it is a "unicode application".  But
internally it uses an 8 bit encoding (normally the native one for the
platform it is running on, which in your case is apparently latin1).

> - the functions with the ending W. For such a application the system 
> code page (native encoding) is completely irrelevant. The system code 
> page is just a compatibility feature that enables Windows NT/Vista/7/8 
> to run applications that were developed for Windows 95 which didn't have 
> unicode support. 

Windows 95 had UCS-2 support, which was pretty close to UTF-16.

But this line of operating systems is dead for 10 years
> now. R obviously is a unicode application because it can print - or read 
> from the clipboard - characters like "ΔЊת" that are not in my system 
> code page which is not possible over the legacy API.

So "unicode application" is something you just made up.

If you use Windows development tools, they have macros to convert
generic functions to either A or W versions.  R doesn't use those.  It
calls the W functions when it has UTF-16 characters, and A functions
when it has native characters.  I would love it if R was a UTF-8
application, because it would make life so much simpler, but Windows
doesn't support that.  So R needs to do tons of conversions.  If you
don't like that, you probably need to stick with Ubuntu.

Duncan Murdoch

> Neither the unicode API nor the legacy API accepts UTF-8. The legacy API 
> needs strings encoded according to the active code page and the unicode 
> API wants UTF-16. If you have UTF-8 you need to convert it in either to 
> the active code page which will loose all characters that are not 
> covered by it or convert to UTF-16 and use the unicode functions. But 
> this is not the problem, the Windows interface functions of R are 
> working quite nicely with unicode already.

> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

More information about the R-devel mailing list