[Rd] Native characterset is wrong for unicode builds for Windows

Fri Feb 27 21:01:47 CET 2015

Am 27.02.2015 um 11:49 schrieb Duncan Murdoch:
> On 27/02/2015 2:31 AM, maillist at tlink.de wrote:
>> Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:
>>> On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
>>>>> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>>>>>> When I send some outlandish characters through enc2native (or format) in
>>>>>> R 3.1.2 on Ubuntu trusty it works quite well:
>>>>>>
>>>>>>     > "®ØΔЊת"
>>>>>> [1] "®ØΔЊת"
>>>>>>     > enc2native("®ØΔЊת")
>>>>>> [1] "®ØΔЊת"
>>>>>>     > Encoding(enc2native("®ØΔЊת"))
>>>>>> [1] "UTF-8"
>>>>>>
>>>>>> In Windows the result is different:
>>>>>>
>>>>>>     > "®ØΔЊת"
>>>>>> [1] "®ØΔЊת"
>>>>>>     > enc2native("®ØΔЊת")
>>>>>> [1] "®Ø<U+0394><U+040A><U+05EA>"
>>>>>>     > Encoding(enc2native("®ØΔЊת"))
>>>>>> [1] "latin1"
>>>>>>
>>>>>> And this is wrong. The native character set of a unicode application
>>>>>> under Windows is *Unicode*. enc2native should do the same under Windows
>>>>>> as it does on Ubuntu. Also the "unknown" encoding should be changed to
>>>>>> mean the same as "UTF-8" exactly as it is on Linux.
>>>>> What is a "unicode application", and what makes you think R is one?  R
>>>>> is being told by Windows that your native encoding is latin1.  Perhaps
>>>>> Windows 8 supports UTF-8 as a native encoding (I've never used it), but
>>>>> previous versions of Windows didn't.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>> A unicode application is a program that uses the unicode API of Windows
>>> R uses those functions, so I guess it is a "unicode application".  But
>>> internally it uses an 8 bit encoding (normally the native one for the
>>> platform it is running on, which in your case is apparently latin1).
>>>
>>>> - the functions with the ending W. For such a application the system
>>>> code page (native encoding) is completely irrelevant. The system code
>>>> page is just a compatibility feature that enables Windows NT/Vista/7/8
>>>> to run applications that were developed for Windows 95 which didn't have
>>>> unicode support.
>>> Windows 95 had UCS-2 support, which was pretty close to UTF-16.
>>>
>>> But this line of operating systems is dead for 10 years
>>>> now. R obviously is a unicode application because it can print - or read
>>>> from the clipboard - characters like "ΔЊת" that are not in my system
>>>> code page which is not possible over the legacy API.
>>> So "unicode application" is something you just made up.
>>>
>>> If you use Windows development tools, they have macros to convert
>>> generic functions to either A or W versions.  R doesn't use those.  It
>>> calls the W functions when it has UTF-16 characters, and A functions
>>> when it has native characters.  I would love it if R was a UTF-8
>>> application, because it would make life so much simpler, but Windows
>>> doesn't support that.  So R needs to do tons of conversions.  If you
>>> don't like that, you probably need to stick with Ubuntu.
>>>
>>> Duncan Murdoch
>>>
>> I am not complaining about those conversions. They work just fine
>> already. I am complaining about
>> enc2native breaking things in the windows builds. An assignment like
>>
>> s <- format("®ØΔЊת")
>>
>> has no interaction with windows at all yet "s" contains garbage like
>> "®Ø<U+0394><U+040A><U+05EA>"
>> after that. And if a native encoding of UTF-8 - as defined by enc2native
>> - works in Ubuntu why shouldn't it work
>> in Windows?
> Because in Ubuntu, UTF-8 is the native encoding, and in your Windows
> system, latin1 is the native encoding.
>
> But I do agree that the format() issue is a problem.  I haven't traced
> through the code, but I think the string "®ØΔЊת" is read using Windows
> API functions that return a UTF-16 result, then converted by R to UTF-8.
>   So format() should see that it is a UTF-8 string and not convert it to
> the native encoding.  There is nothing wrong with enc2native(), it's
> doing what you asked for.  The problem is that format() is using it.
>
> Duncan Murdoch

I would expect that every function that is using enc2native is broken in 
this respect because it invariably will scramble most unicode characters 
in the process and I can't think of a case where this could be wanted 
actually.
Functions that really need something other than UTF-8 are probably using 
iconv and getOption("encoding") anyway as this allows to specify the 
desired encoding much more flexible.