[R] [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error

Scott Kostyshak skostyshak at ufl.edu
Fri May 4 22:47:00 CEST 2018


I have very little knowledge about file encodings and would like to
learn more.

I've read the following pages to learn more:

  http://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
  https://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv
  https://developer.r-project.org/Encodings_and_R.html

The last one, in particular, has been very helpful. I would be
interested in any further references that you suggest.

I attach a file that reproduces the issue I would like to learn more
about. I do not know if the file encoding will be correctly preserved
through email, so I also provide the file (temporarily) on Dropbox here:

  https://www.dropbox.com/s/3lbgebk7b5uaia7/encoding_export_issue.R?dl=0

The file gives an error when using "source()" with the
argument echo = TRUE:

  > source("encoding_export_issue.R", echo = TRUE)
  Error in nchar(dep, "c") : invalid multibyte string, element 1
  In addition: Warning message:
  In grepl("^[[:blank:]]*$", dep[1L]) :
    input string 1 is invalid in this locale

The problem comes from the "á" character in the .R file. The file
appears to be encoded as "iso-8859-1":

  $ file --mime-encoding encoding_export_issue.R 
  encoding_export_issue.R: iso-8859-1

Note that for me:

  > getOption("encoding")
  [1] "native.enc"

so "native.enc" is used for the "encoding" argument of source().

The following two calls succeed:

  > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
  > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")

Is this file a valid "iso-8859-1" encoded file?  Why does source() fail
in the case of encoding set to "native.enc"? Is it because of the
settings to UTF-8 in my locale (see info on my system at the bottom of
this email).

I'm guessing it would be a bad idea to put

  options(encoding = "unknown")

in my .Rprofile, because it is difficult to always correctly guess the
encoding of files? Is there a reason why setting it to "unknown" would
lead to more problems than leaving it set to "native.enc"?

I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
is my session info and locale info for my system with the 3.4.3 version:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.3

> Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

Thanks for your time,

Scott

P.S. Note that I had posted this question to r-devel, which was the
incorrect choice. For archival purposes, I reference the thread here:

https://www.mail-archive.com/search?l=mid&q=20180501185750.445oub53vcdnyyyx%40steph


-- 
Scott Kostyshak
Assistant Professor of Economics
University of Florida
https://people.clas.ufl.edu/skostyshak/


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: encoding_export_issue.R
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20180504/45bab4a0/attachment.ksh>


More information about the R-help mailing list