[Rd] Native characterset is wrong for unicode builds for Windows

maillist at tlink.de maillist at tlink.de
Fri Feb 27 00:34:03 CET 2015

> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>> When I send some outlandish characters through enc2native (or format) in
>> R 3.1.2 on Ubuntu trusty it works quite well:
>>   > "®ØΔЊת"
>> [1] "®ØΔЊת"
>>   > enc2native("®ØΔЊת")
>> [1] "®ØΔЊת"
>>   > Encoding(enc2native("®ØΔЊת"))
>> [1] "UTF-8"
>> In Windows the result is different:
>>   > "®ØΔЊת"
>> [1] "®ØΔЊת"
>>   > enc2native("®ØΔЊת")
>> [1] "®Ø<U+0394><U+040A><U+05EA>"
>>   > Encoding(enc2native("®ØΔЊת"))
>> [1] "latin1"
>> And this is wrong. The native character set of a unicode application
>> under Windows is *Unicode*. enc2native should do the same under Windows
>> as it does on Ubuntu. Also the "unknown" encoding should be changed to
>> mean the same as "UTF-8" exactly as it is on Linux.
> What is a "unicode application", and what makes you think R is one?  R
> is being told by Windows that your native encoding is latin1.  Perhaps
> Windows 8 supports UTF-8 as a native encoding (I've never used it), but
> previous versions of Windows didn't.
> Duncan Murdoch
A unicode application is a program that uses the unicode API of Windows 
- the functions with the ending W. For such a application the system 
code page (native encoding) is completely irrelevant. The system code 
page is just a compatibility feature that enables Windows NT/Vista/7/8 
to run applications that were developed for Windows 95 which didn't have 
unicode support. But this line of operating systems is dead for 10 years 
now. R obviously is a unicode application because it can print - or read 
from the clipboard - characters like "ΔЊת" that are not in my system 
code page which is not possible over the legacy API.

Neither the unicode API nor the legacy API accepts UTF-8. The legacy API 
needs strings encoded according to the active code page and the unicode 
API wants UTF-16. If you have UTF-8 you need to convert it in either to 
the active code page which will loose all characters that are not 
covered by it or convert to UTF-16 and use the unicode functions. But 
this is not the problem, the Windows interface functions of R are 
working quite nicely with unicode already.

More information about the R-devel mailing list