[R] Mac-specific encoding bug

peter dalgaard pdalgd at gmail.com
Sun May 7 22:51:01 CEST 2017


> On 7 May 2017, at 08:36 , Oliver Keyes <ironholds at gmail.com> wrote:
> 
> Hey all,
> 
> I've ran into a weird quirk on Mac platforms, which you can read fully
> at https://github.com/Ironholds/urltools/issues/70
> 
> The long and the short of it is that one specific codepoint - \u04cf -
> does not print in a UTF-8-y way by default, except when run through
> cat(). Compare, for example:
> 
> encodeString("\u04cf")
> 
> and:
> 
> encodeString("\u044D")
> 
> Kevin Ushey was kind enough to bring his expertise, and found that it
> may be a locale-specific problem as well as a Mac-specific problem,
> because 'sourcetools' shows that there's no locale information for the
> character. But this only appears in R - Python has it display
> perfectly - so I'm kind of at a loss. Does anyone know what's going
> on?

Python being less careful than R? 

Basically, things get encoded if not known to be printable, and "Cyrillic Small Letter Palochka" is (it seems) not recorded as printable in the common utf-8 locales. From what I can google, it is used in Chechen and even then only as a postfix to certain characters.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list