[Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error

Ista Zahn i@t@z@hn @ending from gm@il@com
Tue May 1 21:15:30 CEST 2018


Hi Scott,

This question is appropriate for the r-help mailing list, but probably
off-topic here on r-devel.

Best,
Ista

On Tue, May 1, 2018 at 2:57 PM, Scott Kostyshak <skostyshak at ufl.edu> wrote:
> I have very little knowledge about file encodings and would like to
> learn more.
>
> I've read the following pages to learn more:
>
>   https://urldefense.proofpoint.com/v2/url?u=http-3A__stat.ethz.ch_R-2Dmanual_R-2Ddevel_library_base_html_Encoding.html&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=HegPJMcZ_5R6vYtdQLgIsh-M6ElOlewHPBZxe8IPSlI&e=
>   https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_4806823_how-2Dto-2Ddetect-2Dthe-2Dright-2Dencoding-2Dfor-2Dread-2Dcsv&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=KGDvHJrfkvqbwyKnIiY0V45HtN-W4Rpq4ZBXfIFaFMk&e=
>   https://urldefense.proofpoint.com/v2/url?u=https-3A__developer.r-2Dproject.org_Encodings-5Fand-5FR.html&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=Ka1kGiCw3w22tOLfA50AyrKsMT-La14TQdutJJkdE04&e=
>
> The last one, in particular, has been very helpful. I would be
> interested in any further references that you suggest.
>
> I attach a file that reproduces the issue I would like to learn more
> about. I do not know if the file encoding will be correctly preserved
> through email, so I also provide the file (temporarily) on Dropbox here:
>
>   https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dropbox.com_s_3lbgebk7b5uaia7_encoding-5Fexport-5Fissue.R-3Fdl-3D0&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=58a7qB9IHt3s2ZLDglGEHwWARuo8xvSlH_z8G5jDaUY&e=
>
> The file gives an error when using "source()" with the
> argument echo = TRUE:
>
>   > source("encoding_export_issue.R", echo = TRUE)
>   Error in nchar(dep, "c") : invalid multibyte string, element 1
>   In addition: Warning message:
>   In grepl("^[[:blank:]]*$", dep[1L]) :
>     input string 1 is invalid in this locale
>
> The problem comes from the "á" character in the .R file. The file
> appears to be encoded as "iso-8859-1":
>
>   $ file --mime-encoding encoding_export_issue.R
>   encoding_export_issue.R: iso-8859-1
>
> Note that for me:
>
>   > getOption("encoding")
>   [1] "native.enc"
>
> so "native.enc" is used for the "encoding" argument of source().
>
> The following two calls succeed:
>
>   > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
>   > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")
>
> Is this file a valid "iso-8859-1" encoded file?  Why does source() fail
> in the case of encoding set to "native.enc"? Is it because of the
> settings to UTF-8 in my locale (see info on my system at the bottom of
> this email).
>
> I'm guessing it would be a bad idea to put
>
>   options(encoding = "unknown")
>
> in my .Rprofile, because it is difficult to always correctly guess the
> encoding of files? Is there a reason why setting it to "unknown" would
> lead to more problems than leaving it set to "native.enc"?
>
> I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
> is my session info and locale info for my system with the 3.4.3 version:
>
>> sessionInfo()
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.3 LTS
>
> Matrix products: default
> BLAS: /usr/lib/libblas/libblas.so.3.6.0
> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.3
>
>> Sys.getlocale()
> [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
>
> Thanks for your time,
>
> Scott
>
>
> --
> Scott Kostyshak
> Assistant Professor of Economics
> University of Florida
> https://people.clas.ufl.edu/skostyshak/
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>




More information about the R-devel mailing list