[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

peter dalgaard pd@|gd @end|ng |rom gm@||@com
Thu Feb 7 13:55:53 CET 2019


This doesn't seem to be happening on MacOS, neither in Terminal nor RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific. 

-pd

> On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 using gmail.com> wrote:
> 
> Bug
> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> file containing the infinity symbol (' ∞ ') results in the infinity
> symbol imported as the number 8. Other Unicode characters seem
> unaffected, example, Zhe: ж
> 
> Expected Behavior:
> The imported data.frame should represent the infinity symbol as the
> expected 'Inf' so that normal mathematical operations can be processed
> 
> Stack Overflow Post:
> I created a question on Stack Overflow where one other member was able
> to reproduce the same issues I was having. This question can be found
> at:
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> 
> Method to Reproduce - 1:
> A simple method to reproduce this issues is to use R-Studio: In the
> console, type the following:
>> read.table(text=" ∞", encoding="UTF-8")
> 
> The result should be a data.frame with a single value of '8'
> 
> Repeating the same with ж Results in correct expected behavior
> 
> Method to Reproduce - 2:
> Create a .csv file containing the infinity and Zhe characters (I have
> attached the file for convenience, hopefully it is no rejected by your
> email service). Launch an interactive session using
> 
>> r --vanilla
> 
> Enter the following statement taking care to replace the
> <path-to-file> with the appropriate one:
> 
>> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8")
> 
> 
> This should result in a two element data.frame; the first being the
> incorrect value of 8 with an additional <U+FEFF> and the second the
> correct value of Zhe.
> 
> Note the additional <U+FEFF> prefixed to the front of the '8'. This
> appears to be a hidden character for the purposes of letting editors
> know the encoding. The following link has some explanation however, it
> states this is caused by excel. The file I created was done so using
> notepad and not Excel.
> 
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> 
> System Details:
> OS:
>> Windows 10.0.17134 Build 17134
> 
> 
> R Version:
>> platform       x86_64-w64-mingw32
>> arch           x86_64
>> os             mingw32
>> system         x86_64, mingw32
>> status
>> major          3
>> minor          4.1
>> year           2017
>> month          06
>> day            30
>> svn rev        72865
>> language       R
>> version.string R version 3.4.1 (2017-06-30)
>> nickname       Single Candle
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com



More information about the R-devel mailing list