[R] R on Windows crashes when source'ing UTF-8 file

Kenn Konstabel lebatsnok at gmail.com
Fri Jul 11 00:00:00 CEST 2014


I confirm that the original problem doesn't happen in R 3.1.1. in
Windows (XP, this time). That is,

source("http://psych.ut.ee/~R/test-utf8.txt")

.. no longer crashes R but gives a sensible (i.e., understandable,
after this discussion) error.
... and adding encoding="UTF-8-BOM" reads in the file correctly.



On Thu, Jul 10, 2014 at 5:50 PM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
> On 10/07/2014 9:53 AM, Kenn Konstabel wrote:
>>
>> Wow. Thanks a lot!
>>
>> source("http://psych.ut.ee/~nek/R/test-utf8.txt", encoding="UTF-8-BOM")
>> # works correctly on my Windows 7 machine
>> # (and without encoding argument it still crashes R)
>>
>> Kenn
>>
>> On Thu, Jul 10, 2014 at 4:33 PM, John McKown
>> <john.archie.mckown at gmail.com> wrote:
>> > On Thu, Jul 10, 2014 at 7:18 AM, Kenn Konstabel <lebatsnok at gmail.com>
>> > wrote:
>> >> Dear all,
>> >>
>> >> I found an unexpected behaviour when trying to `source` an utf-8 file
>> >> on windows 7:
>> >>
>> >> source("http://psych.ut.ee/~nek/R/test-utf8.txt")
>> >>
>> >> # Rgui.exe reacts:
>> >> # R for windows GUI has stopped working. A problem caused the program
>> >> to stop working correctly.
>> >> # Windows will close the program and notify you if a solution is
>> >> available.
>> >>
>> >> The same will happen with R.exe ("terminal") and R running wihin
>> >> Rstudio. (Session and locale info below).
>> >>
>> >> However, a non-utf version of this little script can be `source`d
>> >> without problems.
>> >>
>> >> source("http://psych.ut.ee/~nek/R/test.txt")
>> >>
>> >> Adding the `encoding` argument to `source` helps a little:
>> >>
>> >> source("http://psych.ut.ee/~nek/R/test-utf8.txt", encoding="utf-8")
>> >> #  unsure about the spelling of utf-8 so I also tried UTF8, utf8, and
>> >> UTF-8
>> >> # ... with the same result in all cases
>> >>
>> >> R doesn't crash any more but gives the following error:
>> >>
>> >> # Error in source("http://psych.ut.ee/~nek/R/test-utf8.txt", encoding
>> >> = "utf-8") :
>> >> #   http://psych.ut.ee/~nek/R/test-utf8.txt:2:0: unexpected end of
>> >> input
>> >> # 1: ?
>> >> #    ^
>> >> # In addition: Warning message:
>> >> # In readLines(file, warn = FALSE) :
>> >> #  invalid input found on input connection
>> >> 'http://psych.ut.ee/~nek/R/test-utf8.txt'
>> >
>> > I just tried that. On Windows XP/Pro,  R 3.1.0 didn't fail, but did
>> > get the error you mention later. I used "wget" to actually download
>> > the file mentioned (on Linux). I think that the problem _may_ be that
>> > the file starts with a BOM (Byte Order Mark), which is 0xef, 0xbb,
>> > 0xef . This is supposed to tell us that this is UTF-8.
>> >
>> > BOM: http://en.wikipedia.org/wiki/Byte_order_mark
>> >
>> > I get an identical error with R 3.1.0 on both Windows XP/Pro and Linux
>> > Fedora 20. The problem is that the R readLines() apparently does not
>> > like the leading BOM. It reads it as data. Most other Linux and
>> > Windows applications _do_ understand the BOM and so, when you use
>> > them, they work properly. And, normally, when you then save the file,
>> > the software does not write the BOM at the start. So it works on the
>> > saved version of the file.
>> >
>> > Being the curious sort, I decided to look at the source to R. In
>> > particular in ~/R/src/main/connections.c I saw where it did support
>> > the reading of BOMs. But there is a special way to do it! Which I
>> > cannot find in the documentation.
>> >
>> > source("http://psych.ut.ee/~nek/R/test-utf8.txt",encoding="UTF-8-BOM");
>> >
>> > I tried the above AND IT WORKED properly!
>> >
>> > I simply adore having source code.
>
>
> Searching the source for the string "UTF-8-BOM" finds it mentioned in the
> docs in 3 places:  in the NEWS file,
> in the R Data Import/Export manual, and in the ?connections help page.
>
> Duncan Murdoch
>
>> >
>> >
>> >>
>> >> I thought maybe that's because what notepad told me is UTF-8 is
>> >> actually something else ... so I did two more experiments.
>> >>
>> >> source("http://psych.ut.ee/~nek/R/test2.R")
>> >> # this was created on a linux machine with leafpad, and saved as utf-8
>> >> text
>> >> # it can be source´d on windows
>> >>
>> >> source("http://psych.ut.ee/~nek/R/test3.R")
>> >> # the same as previous but o's in file were replaced by ö's
>> >> # can be source'd on windows but the "ö" character is shown as ƶ
>> >> # except if you add encoding="utf-8" - then, as expected, it works as
>> >> expected
>> >>
>> >> So in sum, I can create "plain text" (saved with utf-8 encoding) files
>> >> on windows that cannot be sourced to R on windows, or will crash R
>> >> (depending on how you source them). The same files can be sourced on
>> >> linux without problems. Part of the problem is obviously in windows
>> >> but maybe R shouldn't at least crash.
>> >>
>> >> Session info:
>> >>
>> >>  R version 3.0.2 (2013-09-25)
>> >> Platform: i386-w64-mingw32/i386 (32-bit)
>> >>
>> >> locale:
>> >> [1] LC_COLLATE=Estonian_Estonia.1257  LC_CTYPE=Estonian_Estonia.1257
>> >> [3] LC_MONETARY=Estonian_Estonia.1257 LC_NUMERIC=C
>> >> [5] LC_TIME=Estonian_Estonia.1257
>> >>
>> >> attached base packages:
>> >> [1] stats     graphics  grDevices utils     datasets  methods   base
>> >>
>> >> loaded via a namespace (and not attached):
>> >> [1] tools_3.0.2
>> >>
>> >>
>> >> OS: Windows 7
>> >>
>> >> Linux Mint Debian Edition and R 3.0.2 on the other machine (where
>> >> everything worked).
>> >>
>> >> Context:
>> >>
>> >> I was trying to find out how to make files that could be source'd on
>> >> both windows and linux. This is partly solved so I have no specific
>> >> question other than "is this a bug in windows version?" but any
>> >> comments on the general topic would be appreciated too.
>> >>
>> >> Best regards,
>> >>
>> >> Kenn
>> >>
>> >>
>> >> Kenn Konstabel
>> >> Research fellow
>> >> Department of chronic diseases
>> >> National Institute of Health Development
>> >> Hiiu 42
>> >> Tallinn
>> >> Estonia
>> >
>> > --
>> > There is nothing more pleasant than traveling and meeting new people!
>> > Genghis Khan
>> >
>> > Maranatha! <><
>> > John McKown
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list