[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Daniel Possenriede po@@enr|ede @end|ng |rom gm@||@com
Fri Feb 8 17:12:06 CET 2019


Tomas,

> In my scenario, the conversion is invoked by RGui before returning the
input to the main R loop, even before the input gets to the parser. In
principle, we could change this particular conversion in RGui to avoid the
substitution.

Not sure whether I am missing something here, but I used RStudio for my
examples (I should have said) and David's mentioned RStudio as well, so it
does not seem to be a problem with RGui only.

Another example for the "best fit" behaviour seems to be "Σ"
("\u03A3", greek capital letter sigma, not "\u2211", n-ary summation):

print("Σ")
#> [1] "S"

Again with cp1252 on Windows 10, R 3.5.2, RStudio 1.2.1256 preview.

> even though we could rewrite in principle all calls to Windows API to use
Unicode and have all strings in UTF-8 in R, we would still have problems
when interfacing with packages that assume strings are in current native
encoding (without checking), so this problem won't be easy to fix.

Since I regularly encounter the reverse problem, i.e. packages that assume
strings are in UTF-8 encoding without checking (which isn't very
surprising, assuming that most package developers develop on Unix/macOS
systems), I'd say, "rip of the bandaid rather sooner than later". Obviously
I don't know how many bugs would surface in packages if R for Windows'
native encoding were to switch to UTF-8, but these bugs would only be
transitory, I suppose. Whereas there is a steady inflow of
assume-UTF-8-encoding-bugs in new packages and functions with the current
situation.

Best,
Daniel


Am Fr., 8. Feb. 2019 um 13:07 Uhr schrieb Tomas Kalibera <
tomas.kalibera using gmail.com>:

> I can reproduce this behavior on my Windows 10 system in RGui (cp1252):
> when I paste the Unicode infinity symbol into the console, it is treated
> as number 8. This is caused by Windows "best fit" default behavior in
> conversion of unicode characters to characters in the current native
> encoding: at some point in the past, 8 has been chosen as a good fit for
> infinity in Windows. In my scenario, the conversion is invoked by RGui
> before returning the input to the main R loop, even before the input
> gets to the parser. In principle, we could change this particular
> conversion in RGui to avoid the substitution. RGui uses "\uxxxx" escapes
> to pass characters that cannot be represented, this is why e.g. the
> Cyrillic Zhe \u0436 worked, so we could tell Windows not to do the
> substitution and pass "\u221e" for Infinity, and then the string after
> being processed by the parser will be represented in UTF-8 inside R and
> could be e.g. printed by the RGui console. That is something that could
> be considered, but it will not solve the main problem and it may
> actually cause trouble to users who are used to such substitutions
> (especially when the substitutions are more intuitive, but, that may be
> a matter of opinion).
>
> The main problem is that in normal use, sooner or later R will get to
> the point when it will need to do the conversion to native encoding, and
> in some context where "\uxxxx" escapes will not be possible. One cannot
> reliably work with strings in R that cannot be represented in the
> current native encoding (except when one knows precisely how to avoid
> the conversion in some specific task, but that may be brittle; so the
> best-fit substitution might in principle help here). This problem does
> not exist on Unix/macOS systems where the current native encoding is
> UTF-8 these days, so today it only exists on Windows where UTF-8 cannot
> be the current native encoding. As has been discussed before, even
> though we could rewrite in principle all calls to Windows API to use
> Unicode and have all strings in UTF-8 in R, we would still have problems
> when interfacing with packages that assume strings are in current native
> encoding (without checking), so this problem won't be easy to fix.
>
> Best,
> Tomas
>
> On 2/7/19 3:10 PM, Daniel Possenriede wrote:
> > There seems to be something odd with "∞" on Windows (and not only with
> > read.table)
> > In native encoding (cp-1252 in my case), "∞" gets converted to "8"
> >
> > x <-  "∞"
> > Encoding(x)
> > #> [1] "unknown"
> > print(x)
> > #> [1] "8"
> > charToRaw(x)
> > #> [1] 38
> >
> > "∞" is indeed "8"
> >
> > identical(x, "8")
> > #> [1] TRUE
> >
> > Everything seems fine if  "∞" is UTF-8 encoded.
> >
> > y <- "\u221E"
> > Encoding(y)
> > #> [1] "UTF-8"
> > print(y)
> > #> [1]  "∞"
> > charToRaw(y)
> > #> [1] e2 88 9e
> >
> > Unless the string is converted back to native encoding.
> >
> > format(y)
> > #> [1] "8"
> >
> > This ought to be "<U+221E>", equivalently to
> >
> > format("∝")
> > #> [1] "<U+221D>"
> >
> > Session Info:
> >
> > si <- sessionInfo()
> > si$running
> > #> [1] "Windows 10 x64 (build 17134)"
> > si$R.version$version.string
> > #> [1] "R version 3.5.2 (2018-12-20)"
> > si$locale
> > #> [1]
> >
> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> >
> >
> >
> > Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
> > david.byrne222 using gmail.com>:
> >
> >> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
> >> most likely correct; it looks like its Windows specific.
> >>
> >> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd using gmail.com> wrote:
> >>> This doesn't seem to be happening on MacOS, neither in Terminal nor
> >> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
> >>> -pd
> >>>
> >>>> On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 using gmail.com>
> >> wrote:
> >>>> Bug
> >>>> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> >>>> file containing the infinity symbol (' ∞ ') results in the infinity
> >>>> symbol imported as the number 8. Other Unicode characters seem
> >>>> unaffected, example, Zhe: ж
> >>>>
> >>>> Expected Behavior:
> >>>> The imported data.frame should represent the infinity symbol as the
> >>>> expected 'Inf' so that normal mathematical operations can be processed
> >>>>
> >>>> Stack Overflow Post:
> >>>> I created a question on Stack Overflow where one other member was able
> >>>> to reproduce the same issues I was having. This question can be found
> >>>> at:
> >>>>
> >>
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> >>>> Method to Reproduce - 1:
> >>>> A simple method to reproduce this issues is to use R-Studio: In the
> >>>> console, type the following:
> >>>>> read.table(text=" ∞", encoding="UTF-8")
> >>>> The result should be a data.frame with a single value of '8'
> >>>>
> >>>> Repeating the same with ж Results in correct expected behavior
> >>>>
> >>>> Method to Reproduce - 2:
> >>>> Create a .csv file containing the infinity and Zhe characters (I have
> >>>> attached the file for convenience, hopefully it is no rejected by your
> >>>> email service). Launch an interactive session using
> >>>>
> >>>>> r --vanilla
> >>>> Enter the following statement taking care to replace the
> >>>> <path-to-file> with the appropriate one:
> >>>>
> >>>>> read.table("<path-to-file>/unicode_chars.csv", sep=",",
> >> encoding="UTF-8")
> >>>>
> >>>> This should result in a two element data.frame; the first being the
> >>>> incorrect value of 8 with an additional <U+FEFF> and the second the
> >>>> correct value of Zhe.
> >>>>
> >>>> Note the additional <U+FEFF> prefixed to the front of the '8'. This
> >>>> appears to be a hidden character for the purposes of letting editors
> >>>> know the encoding. The following link has some explanation however, it
> >>>> states this is caused by excel. The file I created was done so using
> >>>> notepad and not Excel.
> >>>>
> >>>>
> >>
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> >>>> System Details:
> >>>> OS:
> >>>>> Windows 10.0.17134 Build 17134
> >>>>
> >>>> R Version:
> >>>>> platform       x86_64-w64-mingw32
> >>>>> arch           x86_64
> >>>>> os             mingw32
> >>>>> system         x86_64, mingw32
> >>>>> status
> >>>>> major          3
> >>>>> minor          4.1
> >>>>> year           2017
> >>>>> month          06
> >>>>> day            30
> >>>>> svn rev        72865
> >>>>> language       R
> >>>>> version.string R version 3.4.1 (2017-06-30)
> >>>>> nickname       Single Candle
> >>>> ______________________________________________
> >>>> R-devel using r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>> --
> >>> Peter Dalgaard, Professor,
> >>> Center for Statistics, Copenhagen Business School
> >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> >>> Phone: (+45)38153501
> >>> Office: A 4.23
> >>> Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >> ______________________________________________
> >> R-devel using r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list