[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Mikko Korpela mikko.korpela at aalto.fi
Wed Feb 24 15:55:32 CET 2016


On 24.02.2016 15:47, Duncan Murdoch wrote:
> On 23/02/2016 7:06 AM, Mikko Korpela wrote:
>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>>      on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>
>>>      > Dear R developers
>>>      > I think I have found a bug that can be reproduced with two
>>> lines of code
>>>      > and I am very thankful to get your first assessment or
>>> feed-back on my
>>>      > report.
>>>
>>>      > If this is the wrong mailing list or I did something wrong
>>>      > (e. g. semi "anonymous" email address to protect my privacy
>>> and defend
>>>      > unwanted spam) please let me know since I am new here.
>>>
>>>      > Thank you very much :-)
>>>
>>>      > J. Altfeld
>>>
>>> Dear J.,
>>> (yes, a bit less anonymity would be very welcomed here!),
>>>
>>> You are right, this is a bug, at least in the documentation, but
>>> probably "all real", indeed,
>>>
>>> but read on.
>>>
>>>      > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>>      >>
>>>      >>
>>>      >> If I execute the code from the "?write.table" examples section
>>>      >>
>>>      >> x <- data.frame(a = I("a \" quote"), b = pi)
>>>      >> # (ommited code)
>>>      >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>>      >>
>>>      >> the resulting CSV file has a size of 6 bytes which is too short
>>>      >> (truncated):
>>>      >>
>>>      >> """,3
>>>
>>> reproducibly, yes.
>>> If you look at what write.csv does
>>> and then simplify, you can get a similar wrong result by
>>>
>>>    write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>
>>> which results in a file with one line
>>>
>>> """ 3
>>>
>>> and if you debug  write.table() you see that its building blocks
>>> here are
>>>      file <- file(........, encoding = fileEncoding)
>>>
>>> a      writeLines(*, file=file)  for the column headers,
>>>
>>> and then "deeper down" C code which I did not investigate.
>>
>> I took a look at connections.c. There is a call to strlen() that gets
>> confused by null characters. I think the obvious fix is to avoid the
>> call to strlen() as the size is already known:
>>
>> Index: src/main/connections.c
>> ===================================================================
>> --- src/main/connections.c    (revision 70213)
>> +++ src/main/connections.c    (working copy)
>> @@ -369,7 +369,7 @@
>>           /* is this safe? */
>>           warning(_("invalid char string in output conversion"));
>>           *ob = '\0';
>> -        con->write(outbuf, 1, strlen(outbuf), con);
>> +        con->write(outbuf, 1, ob - outbuf, con);
>>       } while(again && inb > 0);  /* it seems some iconv signal -1 on
>>                          zero-length input */
>>       } else
>>
>>
>>>
>>> But just looking a bit at such a file() object with writeLines()
>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>> "work" for this encoding:
>>>
>>>      > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding =
>>> "UTF-16LE")
>>>      > writeLines(LETTERS[3:1], ff); writeLines("|", ff);
>>> writeLines(">a", ff)
>>>      > close(ff)
>>>      > file.show(fn)
>>>      CBA|>
>>>      > file.size(fn)
>>>      [1] 5
>>>      >
>>
>> With the patch applied:
>>
>>      > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>      [1] "C"  "B"  "A"  "|"  ">a"
>>      > file.size(fn)
>>      [1] 22
> 
> That may be okay on Unix, but it's not enough on Windows.  There the \n
> that writeLines adds at the end of each line isn't translated to
> UTF-16LE properly, so things get messed up.  (I think the \n is
> translated, but the \r that Windows wants is not, so you get a mix of 8
> bit and 16 bit characters.)

That's unfortunate. I tested my tiny patch on Linux. I don't know what
kind of additional changes would be needed to make this work on Windows.

-- 
Mikko Korpela
Aalto University School of Science
Department of Computer Science



More information about the R-devel mailing list