[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Paul McQuesten mcque@ten @end|ng |rom gm@||@com
Thu Feb 7 15:38:55 CET 2019


Windows Notepad prefixes UTF-8 files with a Byte Order Mark (\UFEFF).
Per https://en.wikipedia.org/wiki/Byte_order_mark, this is permitted in
UTF-8, but not required.
I suppose that there are other Windows programs which do likewise (in
addition to Excel and Notepad).

"The Unicode Standard permits the BOM in UTF-8
<https://en.wikipedia.org/wiki/UTF-8>,[3]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-3> but does not
require or recommend its use.[4]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-4> Byte order has
no meaning in UTF-8,[5]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-utf-8-bom-5> so
its only use in UTF-8 is to signal at the start that the text stream is
encoded in UTF-8, or that it was converted to UTF-8 from a stream that
contained an optional BOM. The standard also does not recommend removing a
BOM when it is there, so that round-tripping between encodings does not
lose information, and so that code that relies on it continues to work.[6]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-6>[7]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-7> The IETF
recommends that if a protocol either (a) always uses UTF-8, or (b) has some
other way to indicate what encoding is being used, then it "SHOULD forbid
use of U+FEFF as a signature."[8]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-rfc3629-8>"

On Thu, Feb 7, 2019 at 8:10 AM Daniel Possenriede <possenriede using gmail.com>
wrote:

> There seems to be something odd with "∞" on Windows (and not only with
> read.table)
> In native encoding (cp-1252 in my case), "∞" gets converted to "8"
>
> x <-  "∞"
> Encoding(x)
> #> [1] "unknown"
> print(x)
> #> [1] "8"
> charToRaw(x)
> #> [1] 38
>
> "∞" is indeed "8"
>
> identical(x, "8")
> #> [1] TRUE
>
> Everything seems fine if  "∞" is UTF-8 encoded.
>
> y <- "\u221E"
> Encoding(y)
> #> [1] "UTF-8"
> print(y)
> #> [1]  "∞"
> charToRaw(y)
> #> [1] e2 88 9e
>
> Unless the string is converted back to native encoding.
>
> format(y)
> #> [1] "8"
>
> This ought to be "<U+221E>", equivalently to
>
> format("∝")
> #> [1] "<U+221D>"
>
> Session Info:
>
> si <- sessionInfo()
> si$running
> #> [1] "Windows 10 x64 (build 17134)"
> si$R.version$version.string
> #> [1] "R version 3.5.2 (2018-12-20)"
> si$locale
> #> [1]
>
> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
>
>
>
> Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
> david.byrne222 using gmail.com>:
>
> > I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
> > most likely correct; it looks like its Windows specific.
> >
> > On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd using gmail.com> wrote:
> > >
> > > This doesn't seem to be happening on MacOS, neither in Terminal nor
> > RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
> > >
> > > -pd
> > >
> > > > On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 using gmail.com>
> > wrote:
> > > >
> > > > Bug
> > > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> > > > file containing the infinity symbol (' ∞ ') results in the infinity
> > > > symbol imported as the number 8. Other Unicode characters seem
> > > > unaffected, example, Zhe: ж
> > > >
> > > > Expected Behavior:
> > > > The imported data.frame should represent the infinity symbol as the
> > > > expected 'Inf' so that normal mathematical operations can be
> processed
> > > >
> > > > Stack Overflow Post:
> > > > I created a question on Stack Overflow where one other member was
> able
> > > > to reproduce the same issues I was having. This question can be found
> > > > at:
> > > >
> >
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> > > >
> > > > Method to Reproduce - 1:
> > > > A simple method to reproduce this issues is to use R-Studio: In the
> > > > console, type the following:
> > > >> read.table(text=" ∞", encoding="UTF-8")
> > > >
> > > > The result should be a data.frame with a single value of '8'
> > > >
> > > > Repeating the same with ж Results in correct expected behavior
> > > >
> > > > Method to Reproduce - 2:
> > > > Create a .csv file containing the infinity and Zhe characters (I have
> > > > attached the file for convenience, hopefully it is no rejected by
> your
> > > > email service). Launch an interactive session using
> > > >
> > > >> r --vanilla
> > > >
> > > > Enter the following statement taking care to replace the
> > > > <path-to-file> with the appropriate one:
> > > >
> > > >> read.table("<path-to-file>/unicode_chars.csv", sep=",",
> > encoding="UTF-8")
> > > >
> > > >
> > > > This should result in a two element data.frame; the first being the
> > > > incorrect value of 8 with an additional <U+FEFF> and the second the
> > > > correct value of Zhe.
> > > >
> > > > Note the additional <U+FEFF> prefixed to the front of the '8'. This
> > > > appears to be a hidden character for the purposes of letting editors
> > > > know the encoding. The following link has some explanation however,
> it
> > > > states this is caused by excel. The file I created was done so using
> > > > notepad and not Excel.
> > > >
> > > >
> >
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> > > >
> > > > System Details:
> > > > OS:
> > > >> Windows 10.0.17134 Build 17134
> > > >
> > > >
> > > > R Version:
> > > >> platform       x86_64-w64-mingw32
> > > >> arch           x86_64
> > > >> os             mingw32
> > > >> system         x86_64, mingw32
> > > >> status
> > > >> major          3
> > > >> minor          4.1
> > > >> year           2017
> > > >> month          06
> > > >> day            30
> > > >> svn rev        72865
> > > >> language       R
> > > >> version.string R version 3.4.1 (2017-06-30)
> > > >> nickname       Single Candle
> > > > ______________________________________________
> > > > R-devel using r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> > > --
> > > Peter Dalgaard, Professor,
> > > Center for Statistics, Copenhagen Business School
> > > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > > Phone: (+45)38153501
> > > Office: A 4.23
> > > Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list