[R] Non-ACSII characters in R on Windows

Mon Sep 16 17:56:41 CEST 2013

UTF-8 on windows is a huge pain, this bites me often. Usually I give
up and do the analysis on a Linux server. In previous struggles with
this I've found this blog post enlightening:
https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/

Best,
Ista

On Mon, Sep 16, 2013 at 10:38 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
>> Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
>> > This is a condensed version of the same question on stackexchange here:
>> > http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
>> > If you've already stumbled upon it feel free to ignore.
>> >
>> > My problem is that R on US Windows does not read *any* text file that
>> > contains *any* foreign characters. It simply reads the first consecutive n
>> > ASCII characters and then throws a warning once it reached a foreign
>> > character:
>> >
>> > > test <- read.table("test.txt", sep=";", dec=",", quote="",
>> > fileEncoding="UTF-8")
>> > Warning messages:
>> > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
>> > = "UTF-8") :
>> >   invalid input found on input connection 'test.txt'
>> > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
>> > = "UTF-8") :
>> >   incomplete final line found by readTableHeader on 'test.txt'
>> > > print(test)
>> >        V1
>> > 1 english
>> >
>> > > Sys.getlocale()
>> >    [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>> > States.1252;
>> >      LC_MONETARY=English_United
>> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>> >
>> >
>> > It is important to note that that R on linux will read UTF-8 as well as
>> > exotic character sets without a problem. I've tried it with the exact same
>> > files (one was UTF-8 and another was OEM866 Cyrillic).
>> >
>> > If I do not include the fileEncoding parameter, read.table will read the
>> > whole CSV file. But naturally it will read it wrong because it does not
>> > know the encoding. So whenever I try to specify the fileEncoding, R will
>> > throw the warnings and stop once it reaches a foreign character. It's the
>> > same story with all international character encodings.
>> > Other users on stackexchange have reported exactly the same issue.
>> >
>> >
>> > Is anyone here who is on a US version of Windows able to import files with
>> > foreign characters? Please let me know.
>> A reproducible example would have helped, as requested by the posting
>> guide.
>>
>> Though I am also experiencing the same problem after saving the data
>> below to a CSV file encoded in UTF-8 (you can do this using even the
>> Notepad):
>> "Ա","Բ"
>> 1,10
>> 2,20
>>
>> This is on a Windows 7 box using French locale, but same codepage 1252
>> as yours. What is interesting is that reading the file using
>> readLines(file("myFile.csv", encoding="UTF-8"))
>> gives no invalid characters. So there must be a bug in read.table().
>>
>>
>> But I must note I do not experience issues with French accentuated
>> characters like "é" ("\Ue9"). On the contrary, reading Armenian
>> characters like "Ա" ("\U531") gives weird results: the character appears
>> as <U+0531> instead of Ա.
>>
>> Self-contained example, writing the file and reading it back from R:
>> tmpfile <- tempfile()
>> writeLines("\U531", file(tmpfile, "w", encoding="UTF-8"))
>> readLines(file(tmpfile, encoding="UTF-8"))
>> # "<U+0531>"
>>
>> The same phenomenon happens when creating a data frame from this
>> character (as noted on StackExchange):
>> data.frame("\U531")
>>
>> So my conclusion is that maybe Windows does not really support Unicode
>> characters that are not "relevant" for your current locale. And that may
>> have created bugs in the way R handles them in read.table(). R
>> developers can probably tell us more about it.
> After some more investigation, one part of the problem can be traced
> back to scan() (with myFile.csv filled as described above):
> scan("myFile.csv", encoding="UTF-8", sep=",", nlines=1)
> # Read 2 items
> # [1] "Ա" "Բ"
>
> Equivalent, but nonsensical to me:
> scan("myFile.csv", fileEncoding="CP1252", encoding="UTF-8", sep=",", nlines=1)
> # Read 2 items
> # [1] "Ա" "Բ"
>
> scan("myFile.csv", fileEncoding="UTF-8", sep=",", nlines=1)
> # Read 0 items
> # character(0)
> # Warning message:
> # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
> #  invalid input found on input connection 'myFile.csv'
>
>
> So there seem to be one part of the issue in scan(), which for some
> reason does not work when passed fileEncoding="UTF-8"; and another part
> in read.table(), which transforms "Ա" ("\U531") into "X.U.0531.",
> probably via make.names(), since:
> make.names("\U531")
> # "X.U.0531."
>
>
> Does this make sense to R-core members?
>
>
> Regards
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.