[Rd] writeLines argument useBytes = TRUE still making conversions
kevinushey at gmail.com
Sat Feb 17 23:24:11 CET 2018
Of course, right after writing this e-mail I tested on my Windows
machine and did not see what I expected:
 c3 a9
so obviously I'm misunderstanding something as well.
On Sat, Feb 17, 2018 at 2:19 PM, Kevin Ushey <kevinushey at gmail.com> wrote:
> From my understanding, translation is implied in this line of ?file (from the
> Encoding section):
> The encoding of the input/output stream of a connection can be specified
> by name in the same way as it would be given to iconv: see that help page
> for how to find out what encoding names are recognized on your platform.
> Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is
> the internal encoding of the current locale and hence no translation is
> This is also hinted at in the documentation in ?readLines for its 'encoding'
> argument, which has a different semantic meaning from the 'encoding' argument
> as used with R connections:
> encoding to be assumed for input strings. It is used to mark character
> strings as known to be in Latin-1 or UTF-8: it is not used to re-encode
> the input. To do the latter, specify the encoding as part of the
> connection con or via options(encoding=): see the examples.
> It might be useful to augment the documentation in ?file with something like:
> The 'encoding' argument is used to request the translation of strings when
> writing to a connection.
> and, perhaps to further drive home the point about not translating when
> encoding = "native.enc":
> Note that R will not attempt translation of strings when encoding is
> either "" or "native.enc" (the default, as per getOption("encoding")).
> This implies that attempting to write, for example, UTF-8 encoded content
> to a connection opened using "native.enc" will retain its original UTF-8
> encoding -- it will not be translated.
> It is a bit surprising that 'native.enc' means "do not translate" rather than
> "attempt translation to the encoding associated with the current locale", but
> those are the semantics and they are not bound to change.
> This is the code I used to convince myself of that case:
> conn <- file(tempfile(), encoding = "native.enc", open = "w+")
> before <- iconv('é', to = "UTF-8")
> cat(before, file = conn, sep = "\n")
> after <- readLines(conn)
> with output:
> > charToRaw(before)
>  c3 a9
> > charToRaw(after)
>  c3 a9
> On Thu, Feb 15, 2018 at 9:16 AM, Ista Zahn <istazahn at gmail.com> wrote:
>> On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinushey at gmail.com> wrote:
>>> I suspect your UTF-8 string is being stripped of its encoding before
>>> write, and so assumed to be in the system native encoding, and then
>>> re-encoded as UTF-8 when written to the file. You can see something
>>> similar with:
>>> > tmp <- 'é'
>>> > tmp <- iconv(tmp, to = 'UTF-8')
>>> > Encoding(tmp) <- "unknown"
>>> > charToRaw(iconv(tmp, to = "UTF-8"))
>>>  c3 83 c2 a9
>>> It's worth saying that:
>>> file(..., encoding = "UTF-8")
>>> means "attempt to re-encode strings as UTF-8 when writing to this
>>> file". However, if you already know your text is UTF-8, then you
>>> likely want to avoid opening a connection that might attempt to
>>> re-encode the input. Conversely (assuming I'm understanding the
>>> documentation correctly)
>>> file(..., encoding = "native.enc")
>>> means "assume that strings are in the native encoding, and hence
>>> translation is unnecessary". Note that it does not mean "attempt to
>>> translate strings to the native encoding".
>> If all that is true I think ?file needs some attention. I've read it
>> several times now and I just don't see how it can be interpreted as
>> you've described it.
>>> Also note that writeLines(..., useBytes = FALSE) will explicitly
>>> translate to the current encoding before sending bytes to the
>>> requested connection. In other words, there are two locations where
>>> translation might occur in your example:
>>> 1) In the call to writeLines(),
>>> 2) When characters are passed to the connection.
>>> In your case, it sounds like translation should be suppressed at both steps.
>>> I think this is documented correctly in ?writeLines (and also the
>>> Encoding section of ?file), but the behavior may feel unfamiliar at
>>> first glance.
>>> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at live.com> wrote:
>>>> I think this behavior is inconsistent with the documentation:
>>>> tmp <- 'é'
>>>> tmp <- iconv(tmp, to = 'UTF-8')
>>>> tmpfilepath <- tempfile()
>>>> writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE)
>>>>  "UTF-8"
>>>>  c3 a9
>>>> Raw text as hex: c3 83 c2 a9
>>>> If I switch to useBytes = FALSE, then the variable is written correctly as c3 a9.
>>>> Any thoughts? This behavior is related to this issue: https://github.com/yihui/knitr/issues/1509
>>>> [[alternative HTML version deleted]]
>>>> R-devel at r-project.org mailing list
>>> R-devel at r-project.org mailing list
More information about the R-devel