[Rd] writeLines argument useBytes = TRUE still making conversions
istazahn at gmail.com
Thu Feb 15 18:16:59 CET 2018
On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinushey at gmail.com> wrote:
> I suspect your UTF-8 string is being stripped of its encoding before
> write, and so assumed to be in the system native encoding, and then
> re-encoded as UTF-8 when written to the file. You can see something
> similar with:
> > tmp <- 'é'
> > tmp <- iconv(tmp, to = 'UTF-8')
> > Encoding(tmp) <- "unknown"
> > charToRaw(iconv(tmp, to = "UTF-8"))
>  c3 83 c2 a9
> It's worth saying that:
> file(..., encoding = "UTF-8")
> means "attempt to re-encode strings as UTF-8 when writing to this
> file". However, if you already know your text is UTF-8, then you
> likely want to avoid opening a connection that might attempt to
> re-encode the input. Conversely (assuming I'm understanding the
> documentation correctly)
> file(..., encoding = "native.enc")
> means "assume that strings are in the native encoding, and hence
> translation is unnecessary". Note that it does not mean "attempt to
> translate strings to the native encoding".
If all that is true I think ?file needs some attention. I've read it
several times now and I just don't see how it can be interpreted as
you've described it.
> Also note that writeLines(..., useBytes = FALSE) will explicitly
> translate to the current encoding before sending bytes to the
> requested connection. In other words, there are two locations where
> translation might occur in your example:
> 1) In the call to writeLines(),
> 2) When characters are passed to the connection.
> In your case, it sounds like translation should be suppressed at both steps.
> I think this is documented correctly in ?writeLines (and also the
> Encoding section of ?file), but the behavior may feel unfamiliar at
> first glance.
> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at live.com> wrote:
>> I think this behavior is inconsistent with the documentation:
>> tmp <- 'é'
>> tmp <- iconv(tmp, to = 'UTF-8')
>> tmpfilepath <- tempfile()
>> writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE)
>>  "UTF-8"
>>  c3 a9
>> Raw text as hex: c3 83 c2 a9
>> If I switch to useBytes = FALSE, then the variable is written correctly as c3 a9.
>> Any thoughts? This behavior is related to this issue: https://github.com/yihui/knitr/issues/1509
>> [[alternative HTML version deleted]]
>> R-devel at r-project.org mailing list
> R-devel at r-project.org mailing list
More information about the R-devel