[R] [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error

Scott Kostyshak @ko@ty@h@k @end|ng |rom u||@edu
Sat May 5 22:52:34 CEST 2018


On Fri, May 04, 2018 at 10:58:26PM +0000, Ista Zahn wrote:
> On Fri, May 4, 2018 at 4:47 PM, Scott Kostyshak <skostyshak using ufl.edu> wrote:
> > I have very little knowledge about file encodings and would like to
> > learn more.
> >
> > I've read the following pages to learn more:
> >
> >   https://urldefense.proofpoint.com/v2/url?u=http-3A__stat.ethz.ch_R-2Dmanual_R-2Ddevel_library_base_html_Encoding.html&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=PSqR5opjnHspAeM6Edm1ddsaY3ok1bnV-t6W4MKtVCM&e=
> >   https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_4806823_how-2Dto-2Ddetect-2Dthe-2Dright-2Dencoding-2Dfor-2Dread-2Dcsv&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=1M6pNfwFR5uG5DkSAHPpXZKYETCiwV1wsJxpew6lThY&e=
> >   https://urldefense.proofpoint.com/v2/url?u=https-3A__developer.r-2Dproject.org_Encodings-5Fand-5FR.html&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=hAF57aL9khHQ_2Ndars7qMO-FoqxnnmOiEDIprsllko&e=
> >
> > The last one, in particular, has been very helpful. I would be
> > interested in any further references that you suggest.
> >
> > I attach a file that reproduces the issue I would like to learn more
> > about. I do not know if the file encoding will be correctly preserved
> > through email, so I also provide the file (temporarily) on Dropbox here:
> >
> >   https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dropbox.com_s_3lbgebk7b5uaia7_encoding-5Fexport-5Fissue.R-3Fdl-3D0&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=fGtYdB-U7ktXVFeniRudE-ZmxmCP3ZUfeLOvJ0AJwqs&e=
> >
> > The file gives an error when using "source()" with the
> > argument echo = TRUE:
> >
> >   > source("encoding_export_issue.R", echo = TRUE)
> >   Error in nchar(dep, "c") : invalid multibyte string, element 1
> >   In addition: Warning message:
> >   In grepl("^[[:blank:]]*$", dep[1L]) :
> >     input string 1 is invalid in this locale
> >
> > The problem comes from the "á" character in the .R file. The file
> > appears to be encoded as "iso-8859-1":
> >
> >   $ file --mime-encoding encoding_export_issue.R
> >   encoding_export_issue.R: iso-8859-1
> >
> > Note that for me:
> >
> >   > getOption("encoding")
> >   [1] "native.enc"
> >
> > so "native.enc" is used for the "encoding" argument of source().
> >
> > The following two calls succeed:
> >
> >   > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
> >   > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")
> >
> > Is this file a valid "iso-8859-1" encoded file?
> 
> The one you attached is not. The one linked to in dropbox is.
> 
>  Why does source() fail
> > in the case of encoding set to "native.enc"? Is it because of the
> > settings to UTF-8 in my locale (see info on my system at the bottom of
> > this email).
> 
> Yes.
> 
> >
> > I'm guessing it would be a bad idea to put
> >
> >   options(encoding = "unknown")
> >
> > in my .Rprofile, because it is difficult to always correctly guess the
> > encoding of files?
> 
> My guess is that the issue is less about the difficulty of guessing
> the encoding, and more about the time it takes to do so. That's not
> particularly relevant for the "source" function, but the encoding
> option is used by many of the file IO functions in R and so has
> implications well beyond the behavior of "source".

Ah I did not think about this possibility. Makes sense.

> 
>  Is there a reason why setting it to "unknown" would
> > lead to more problems than leaving it set to "native.enc"?
> 
> It depends on what you are actually doing. If you are on a UTF-8
> locale and working exclusively with UTF-8 files, setting
> options(encoding = "unknown") will just slow down your file IO by
> checking for the encoding every time.

Good to know. Thank you for your response, Ista.

Scott


-- 
Scott Kostyshak
Assistant Professor of Economics
University of Florida
https://people.clas.ufl.edu/skostyshak/

> >
> > I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
> > is my session info and locale info for my system with the 3.4.3 version:
> >
> >> sessionInfo()
> > R version 3.4.3 (2017-11-30)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> > Running under: Ubuntu 16.04.3 LTS
> >
> > Matrix products: default
> > BLAS: /usr/lib/libblas/libblas.so.3.6.0
> > LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
> >
> > locale:
> >  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> >  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> >
> > loaded via a namespace (and not attached):
> > [1] compiler_3.4.3
> >
> >> Sys.getlocale()
> > [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
> >
> > Thanks for your time,
> >
> > Scott
> >
> > P.S. Note that I had posted this question to r-devel, which was the
> > incorrect choice. For archival purposes, I reference the thread here:
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_search-3Fl-3Dmid-26q-3D20180501185750.445oub53vcdnyyyx-2540steph&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=rWb2owVxdai483O9Lb6Al-ATizQX1zeAinXMeWweFLE&e=
> >
> >
> > --
> > Scott Kostyshak
> > Assistant Professor of Economics
> > University of Florida
> > https://people.clas.ufl.edu/skostyshak/
> >
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=b5inw8dJraPVuT9OF5_XOpqG7eM9RNLAk7HYGyl-hQY&e=
> > PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=96nY2mWP-VjDhL-gH0cMDo4jyfg1ZKHGkBXif_fmWTM&e=
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=b5inw8dJraPVuT9OF5_XOpqG7eM9RNLAk7HYGyl-hQY&e=
> PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=96nY2mWP-VjDhL-gH0cMDo4jyfg1ZKHGkBXif_fmWTM&e=
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list