[Rd] Native characterset is wrong for unicode builds for Windows

maillist at tlink.de maillist at tlink.de
Fri Feb 27 00:55:25 CET 2015


Am 26.02.2015 um 23:44 schrieb Winston Chang:
> On Thu, Feb 26, 2015 at 2:09 PM, maillist at tlink.de 
> <mailto:maillist at tlink.de> <maillist at tlink.de 
> <mailto:maillist at tlink.de>> wrote:
>
>
>     When I send some outlandish characters through enc2native (or
>     format) in R 3.1.2 on Ubuntu trusty it works quite well:
>
>     > "®ØΔЊת"
>     [1] "®ØΔЊת"
>     > enc2native("®ØΔЊת")
>     [1] "®ØΔЊת"
>     > Encoding(enc2native("®ØΔЊת"))
>     [1] "UTF-8"
>
>     In Windows the result is different:
>
>     > "®ØΔЊת"
>     [1] "®ØΔЊת"
>     > enc2native("®ØΔЊת")
>     [1] "®Ø<U+0394><U+040A><U+05EA>"
>     > Encoding(enc2native("®ØΔЊת"))
>     [1] "latin1"
>
>     And this is wrong. The native character set of a unicode
>     application under Windows is *Unicode*. enc2native should do the
>     same under Windows as it does on Ubuntu. Also the "unknown"
>     encoding should be changed to mean the same as "UTF-8" exactly as
>     it is on Linux.
>
>
> I think you're mixing up the term "character set" with the encoding 
> for a character set. Unicode is a character set. UTF-8 is one of many 
> encodings of Unicode.
>
> UTF-8 may be the native character encoding in Ubuntu, but it's not the 
> native encoding in Windows. According to this, what counts as the 
> native encoding in Windows depends on the code page:
> http://stackoverflow.com/a/4649507
>
> So you shouldn't expect enc2native to do the same thing on Linux and 
> Windows. If you really want UTF-8, you can use enc2utf8.
>
> -Winston

Maybe I'm expecting too much but I rather have R not to produce garbage 
like "®Ø<U+0394><U+040A><U+05EA>" and while I can use enc2utf8 to 
convert from UTF-8 to UTF-8 this does not fix the many places - like 
"format" - where enc2native is used and that are broken because of this.



	[[alternative HTML version deleted]]



More information about the R-devel mailing list