[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Fri Feb 8 17:31:52 CET 2019

On 08/02/2019 11:12 a.m., Daniel Possenriede wrote:
> Tomas,
> 
>> In my scenario, the conversion is invoked by RGui before returning the
> input to the main R loop, even before the input gets to the parser. In
> principle, we could change this particular conversion in RGui to avoid the
> substitution.
> 
> Not sure whether I am missing something here, but I used RStudio for my
> examples (I should have said) and David's mentioned RStudio as well, so it
> does not seem to be a problem with RGui only.
> 
> Another example for the "best fit" behaviour seems to be "Σ"
> ("\u03A3", greek capital letter sigma, not "\u2211", n-ary summation):
> 
> print("Σ")
> #> [1] "S"
> 
> Again with cp1252 on Windows 10, R 3.5.2, RStudio 1.2.1256 preview.
> 
>> even though we could rewrite in principle all calls to Windows API to use
> Unicode and have all strings in UTF-8 in R, we would still have problems
> when interfacing with packages that assume strings are in current native
> encoding (without checking), so this problem won't be easy to fix.
> 
> Since I regularly encounter the reverse problem, i.e. packages that assume
> strings are in UTF-8 encoding without checking (which isn't very
> surprising, assuming that most package developers develop on Unix/macOS
> systems), I'd say, "rip of the bandaid rather sooner than later". Obviously
> I don't know how many bugs would surface in packages if R for Windows'
> native encoding were to switch to UTF-8, but these bugs would only be
> transitory, I suppose. Whereas there is a steady inflow of
> assume-UTF-8-encoding-bugs in new packages and functions with the current
> situation.

Just one minor comment:  it is *impossible* for R for Windows "native" 
encoding to switch to UTF-8, since Windows doesn't support that.  The 
necessary change (which I'd support, but it's a really large amount of 
work) would be for R to drop its use of native encodings internally. 
Convert everything to UTF-8 on the way in, convert to native on the way out.

This is a large amount of work because R has preferred native encodings 
basically forever, so there are tons of locations needing changes, and a 
large effort would be required to make them.  It would likely be easier 
for Windows to add UTF-8 as a native encoding.  Converting between that 
and Windows internal UTF-16 is nearly trivial, much easier than many of 
the conversions it does.  And Microsoft has revenues of $90 billion per 
year, while R Core only has a few individuals donating their time:  so 
wouldn't it make more sense to ask them to act like responsible members 
of the computing community?

Duncan Murdoch

> 
> Best,
> Daniel
> 
> 
> Am Fr., 8. Feb. 2019 um 13:07 Uhr schrieb Tomas Kalibera <
> tomas.kalibera using gmail.com>:
> 
>> I can reproduce this behavior on my Windows 10 system in RGui (cp1252):
>> when I paste the Unicode infinity symbol into the console, it is treated
>> as number 8. This is caused by Windows "best fit" default behavior in
>> conversion of unicode characters to characters in the current native
>> encoding: at some point in the past, 8 has been chosen as a good fit for
>> infinity in Windows. In my scenario, the conversion is invoked by RGui
>> before returning the input to the main R loop, even before the input
>> gets to the parser. In principle, we could change this particular
>> conversion in RGui to avoid the substitution. RGui uses "\uxxxx" escapes
>> to pass characters that cannot be represented, this is why e.g. the
>> Cyrillic Zhe \u0436 worked, so we could tell Windows not to do the
>> substitution and pass "\u221e" for Infinity, and then the string after
>> being processed by the parser will be represented in UTF-8 inside R and
>> could be e.g. printed by the RGui console. That is something that could
>> be considered, but it will not solve the main problem and it may
>> actually cause trouble to users who are used to such substitutions
>> (especially when the substitutions are more intuitive, but, that may be
>> a matter of opinion).
>>
>> The main problem is that in normal use, sooner or later R will get to
>> the point when it will need to do the conversion to native encoding, and
>> in some context where "\uxxxx" escapes will not be possible. One cannot
>> reliably work with strings in R that cannot be represented in the
>> current native encoding (except when one knows precisely how to avoid
>> the conversion in some specific task, but that may be brittle; so the
>> best-fit substitution might in principle help here). This problem does
>> not exist on Unix/macOS systems where the current native encoding is
>> UTF-8 these days, so today it only exists on Windows where UTF-8 cannot
>> be the current native encoding. As has been discussed before, even
>> though we could rewrite in principle all calls to Windows API to use
>> Unicode and have all strings in UTF-8 in R, we would still have problems
>> when interfacing with packages that assume strings are in current native
>> encoding (without checking), so this problem won't be easy to fix.
>>
>> Best,
>> Tomas
>>
>> On 2/7/19 3:10 PM, Daniel Possenriede wrote:
>>> There seems to be something odd with "∞" on Windows (and not only with
>>> read.table)
>>> In native encoding (cp-1252 in my case), "∞" gets converted to "8"
>>>
>>> x <-  "∞"
>>> Encoding(x)
>>> #> [1] "unknown"
>>> print(x)
>>> #> [1] "8"
>>> charToRaw(x)
>>> #> [1] 38
>>>
>>> "∞" is indeed "8"
>>>
>>> identical(x, "8")
>>> #> [1] TRUE
>>>
>>> Everything seems fine if  "∞" is UTF-8 encoded.
>>>
>>> y <- "\u221E"
>>> Encoding(y)
>>> #> [1] "UTF-8"
>>> print(y)
>>> #> [1]  "∞"
>>> charToRaw(y)
>>> #> [1] e2 88 9e
>>>
>>> Unless the string is converted back to native encoding.
>>>
>>> format(y)
>>> #> [1] "8"
>>>
>>> This ought to be "<U+221E>", equivalently to
>>>
>>> format("∝")
>>> #> [1] "<U+221D>"
>>>
>>> Session Info:
>>>
>>> si <- sessionInfo()
>>> si$running
>>> #> [1] "Windows 10 x64 (build 17134)"
>>> si$R.version$version.string
>>> #> [1] "R version 3.5.2 (2018-12-20)"
>>> si$locale
>>> #> [1]
>>>
>> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
>>>
>>>
>>>
>>> Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
>>> david.byrne222 using gmail.com>:
>>>
>>>> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
>>>> most likely correct; it looks like its Windows specific.
>>>>
>>>> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd using gmail.com> wrote:
>>>>> This doesn't seem to be happening on MacOS, neither in Terminal nor
>>>> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
>>>>> -pd
>>>>>
>>>>>> On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 using gmail.com>
>>>> wrote:
>>>>>> Bug
>>>>>> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
>>>>>> file containing the infinity symbol (' ∞ ') results in the infinity
>>>>>> symbol imported as the number 8. Other Unicode characters seem
>>>>>> unaffected, example, Zhe: ж
>>>>>>
>>>>>> Expected Behavior:
>>>>>> The imported data.frame should represent the infinity symbol as the
>>>>>> expected 'Inf' so that normal mathematical operations can be processed
>>>>>>
>>>>>> Stack Overflow Post:
>>>>>> I created a question on Stack Overflow where one other member was able
>>>>>> to reproduce the same issues I was having. This question can be found
>>>>>> at:
>>>>>>
>>>>
>> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
>>>>>> Method to Reproduce - 1:
>>>>>> A simple method to reproduce this issues is to use R-Studio: In the
>>>>>> console, type the following:
>>>>>>> read.table(text=" ∞", encoding="UTF-8")
>>>>>> The result should be a data.frame with a single value of '8'
>>>>>>
>>>>>> Repeating the same with ж Results in correct expected behavior
>>>>>>
>>>>>> Method to Reproduce - 2:
>>>>>> Create a .csv file containing the infinity and Zhe characters (I have
>>>>>> attached the file for convenience, hopefully it is no rejected by your
>>>>>> email service). Launch an interactive session using
>>>>>>
>>>>>>> r --vanilla
>>>>>> Enter the following statement taking care to replace the
>>>>>> <path-to-file> with the appropriate one:
>>>>>>
>>>>>>> read.table("<path-to-file>/unicode_chars.csv", sep=",",
>>>> encoding="UTF-8")
>>>>>>
>>>>>> This should result in a two element data.frame; the first being the
>>>>>> incorrect value of 8 with an additional <U+FEFF> and the second the
>>>>>> correct value of Zhe.
>>>>>>
>>>>>> Note the additional <U+FEFF> prefixed to the front of the '8'. This
>>>>>> appears to be a hidden character for the purposes of letting editors
>>>>>> know the encoding. The following link has some explanation however, it
>>>>>> states this is caused by excel. The file I created was done so using
>>>>>> notepad and not Excel.
>>>>>>
>>>>>>
>>>>
>> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
>>>>>> System Details:
>>>>>> OS:
>>>>>>> Windows 10.0.17134 Build 17134
>>>>>>
>>>>>> R Version:
>>>>>>> platform       x86_64-w64-mingw32
>>>>>>> arch           x86_64
>>>>>>> os             mingw32
>>>>>>> system         x86_64, mingw32
>>>>>>> status
>>>>>>> major          3
>>>>>>> minor          4.1
>>>>>>> year           2017
>>>>>>> month          06
>>>>>>> day            30
>>>>>>> svn rev        72865
>>>>>>> language       R
>>>>>>> version.string R version 3.4.1 (2017-06-30)
>>>>>>> nickname       Single Candle
>>>>>> ______________________________________________
>>>>>> R-devel using r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>> --
>>>>> Peter Dalgaard, Professor,
>>>>> Center for Statistics, Copenhagen Business School
>>>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>>>> Phone: (+45)38153501
>>>>> Office: A 4.23
>>>>> Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ______________________________________________
>>>> R-devel using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>