[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
murdoch.duncan at gmail.com
Wed Feb 24 14:47:09 CET 2016
On 23/02/2016 7:06 AM, Mikko Korpela wrote:
> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>> > Dear R developers
>> > I think I have found a bug that can be reproduced with two lines of code
>> > and I am very thankful to get your first assessment or feed-back on my
>> > report.
>> > If this is the wrong mailing list or I did something wrong
>> > (e. g. semi "anonymous" email address to protect my privacy and defend
>> > unwanted spam) please let me know since I am new here.
>> > Thank you very much :-)
>> > J. Altfeld
>> Dear J.,
>> (yes, a bit less anonymity would be very welcomed here!),
>> You are right, this is a bug, at least in the documentation, but
>> probably "all real", indeed,
>> but read on.
>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>> >> If I execute the code from the "?write.table" examples section
>> >> x <- data.frame(a = I("a \" quote"), b = pi)
>> >> # (ommited code)
>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>> >> the resulting CSV file has a size of 6 bytes which is too short
>> >> (truncated):
>> >> """,3
>> reproducibly, yes.
>> If you look at what write.csv does
>> and then simplify, you can get a similar wrong result by
>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>> which results in a file with one line
>> """ 3
>> and if you debug write.table() you see that its building blocks
>> here are
>> file <- file(........, encoding = fileEncoding)
>> a writeLines(*, file=file) for the column headers,
>> and then "deeper down" C code which I did not investigate.
> I took a look at connections.c. There is a call to strlen() that gets
> confused by null characters. I think the obvious fix is to avoid the
> call to strlen() as the size is already known:
> Index: src/main/connections.c
> --- src/main/connections.c (revision 70213)
> +++ src/main/connections.c (working copy)
> @@ -369,7 +369,7 @@
> /* is this safe? */
> warning(_("invalid char string in output conversion"));
> *ob = '\0';
> - con->write(outbuf, 1, strlen(outbuf), con);
> + con->write(outbuf, 1, ob - outbuf, con);
> } while(again && inb > 0); /* it seems some iconv signal -1 on
> zero-length input */
> } else
>> But just looking a bit at such a file() object with writeLines()
>> seems slightly revealing, as e.g., 'eol' does not seem to
>> "work" for this encoding:
>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
>> > close(ff)
>> > file.show(fn)
>> > file.size(fn)
>>  5
> With the patch applied:
> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>  "C" "B" "A" "|" ">a"
> > file.size(fn)
>  22
That may be okay on Unix, but it's not enough on Windows. There the \n
that writeLines adds at the end of each line isn't translated to
UTF-16LE properly, so things get messed up. (I think the \n is
translated, but the \r that Windows wants is not, so you get a mix of 8
bit and 16 bit characters.)
> - Mikko Korpela
>> >> The problem seems to be the iconv function:
>> >> iconv("foo", to="UTF-16")
>> >> produces
>> >> Error in iconv("foo", to = "UTF-16"):
>> >> embedded nul in string: '\xff\xfef\0o\0o\0'
>> but this works
>> > iconv("foo", to="UTF-16", toRaw=TRUE)
>>  ff fe 66 00 6f 00 6f 00
>> (indeed showing the embedded '\0's)
>> >> In 2010 a (partial) patch for this problem was submitted:
>> >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html
>> the patch only related to the iconv() problem not allowing 'raw'
>> (instead of character) argument x.
>> ... and it is > 5.5 years old, for an iconv() version that was less
>> featureful than today.
>> Rather, current iconv(x) allows x to be a list of raw entries.
>> >> Are there chances to fix this problem since it prevents writing Windows
>> >> UTF-16LE text files?
>> >> PS: This problem can be reproduced on Windows and Linux.
>> indeed.... also on "R devel of today".
>> I agree it should be fixed... but as I said not by the patch you
>> Tested patches to fix this are welcome, indeed.
> R-devel at r-project.org mailing list
More information about the R-devel