[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Mikko Korpela mikko.korpela at aalto.fi
Mon Feb 29 21:30:41 CET 2016


The file.show() issue is now in the bug tracker. I used a slightly
different example to demonstrate the problem.

https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16738

- Mikko

On 29.02.2016 20:30, Duncan Murdoch wrote:
> I have just committed your first patch (the strlen() replacement) to
> R-devel, and will soon put it in R-patched as well.  I wont have time to
> look at this again before the 3.2.4 release, so your file.show() patch
> isn't going to make it unless someone else gets to it.
> 
> There's still a faint chance that I'll do more in R-devel before 3.3.0,
> but I think it's best if there were bug reports about both of these
> problems so they don't get forgotten.  Since the first one is mainly a
> Windows problem, I'll write that one up; I'd appreciate it if you could
> write up the file.show() issue, after checking against R-devel rev 70247
> or higher.
> 
> Duncan Murdoch
> 
> On 25/02/2016 5:54 AM, Mikko Korpela wrote:
>> On 25.02.2016 11:31, Mikko Korpela wrote:
>>> On 23.02.2016 14:06, Mikko Korpela wrote:
>>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>>>>      on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>>>
>>>>>      > Dear R developers
>>>>>      > I think I have found a bug that can be reproduced with two
>>>>> lines of code
>>>>>      > and I am very thankful to get your first assessment or
>>>>> feed-back on my
>>>>>      > report.
>>>>>
>>>>>      > If this is the wrong mailing list or I did something wrong
>>>>>      > (e. g. semi "anonymous" email address to protect my privacy
>>>>> and defend
>>>>>      > unwanted spam) please let me know since I am new here.
>>>>>
>>>>>      > Thank you very much :-)
>>>>>
>>>>>      > J. Altfeld
>>>>>
>>>>> Dear J.,
>>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>>
>>>>> You are right, this is a bug, at least in the documentation, but
>>>>> probably "all real", indeed,
>>>>>
>>>>> but read on.
>>>>>
>>>>>      > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>>>>      >>
>>>>>      >>
>>>>>      >> If I execute the code from the "?write.table" examples section
>>>>>      >>
>>>>>      >> x <- data.frame(a = I("a \" quote"), b = pi)
>>>>>      >> # (ommited code)
>>>>>      >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>>>>      >>
>>>>>      >> the resulting CSV file has a size of 6 bytes which is too
>>>>> short
>>>>>      >> (truncated):
>>>>>      >>
>>>>>      >> """,3
>>>>>
>>>>> reproducibly, yes.
>>>>> If you look at what write.csv does
>>>>> and then simplify, you can get a similar wrong result by
>>>>>
>>>>>    write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>>>
>>>>> which results in a file with one line
>>>>>
>>>>> """ 3
>>>>>
>>>>> and if you debug  write.table() you see that its building blocks
>>>>> here are
>>>>>      file <- file(........, encoding = fileEncoding)
>>>>>
>>>>> a      writeLines(*, file=file)  for the column headers,
>>>>>
>>>>> and then "deeper down" C code which I did not investigate.
>>>>
>>>> I took a look at connections.c. There is a call to strlen() that gets
>>>> confused by null characters. I think the obvious fix is to avoid the
>>>> call to strlen() as the size is already known:
>>>>
>>>> Index: src/main/connections.c
>>>> ===================================================================
>>>> --- src/main/connections.c    (revision 70213)
>>>> +++ src/main/connections.c    (working copy)
>>>> @@ -369,7 +369,7 @@
>>>>           /* is this safe? */
>>>>           warning(_("invalid char string in output conversion"));
>>>>           *ob = '\0';
>>>> -        con->write(outbuf, 1, strlen(outbuf), con);
>>>> +        con->write(outbuf, 1, ob - outbuf, con);
>>>>       } while(again && inb > 0);  /* it seems some iconv signal -1 on
>>>>                          zero-length input */
>>>>       } else
>>>>
>>>>
>>>>>
>>>>> But just looking a bit at such a file() object with writeLines()
>>>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>>>> "work" for this encoding:
>>>>>
>>>>>      > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding =
>>>>> "UTF-16LE")
>>>>>      > writeLines(LETTERS[3:1], ff); writeLines("|", ff);
>>>>> writeLines(">a", ff)
>>>>>      > close(ff)
>>>>>      > file.show(fn)
>>>>>      CBA|>
>>>>>      > file.size(fn)
>>>>>      [1] 5
>>>>>      >
>>>>
>>>> With the patch applied:
>>>>
>>>>      > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>>>      [1] "C"  "B"  "A"  "|"  ">a"
>>>>      > file.size(fn)
>>>>      [1] 22
>>> I just realized that I was misusing the encoding argument of
>>> readLines(). The code above works by accident, but the following would
>>> be more appropriate:
>>>
>>>      > ff <- file(fn, open="r", encoding="UTF-16LE")
>>>      > readLines(ff)
>>>      [1] "C"  "B"  "A"  "|"  ">a"
>>>      > close(ff)
>>>
>>> Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
>>> the patch is incomplete on Windows.)
>> Before inspecting the file with readLines() I tried file.show() but it
>> did not work as expected. On Linux using a UTF-8 locale, the result of
>> trying to show the truly UTF-16LE encoded file with
>>
>>      > file.show(fn, encoding="UTF-16LE")
>>
>> was a pager showing "<43>" (quotes not included) followed by several
>> empty lines.
>>
>> With the following patch, the command works correctly (in this case, on
>> this platform, not tested comprehensively). The idea is to read the
>> input file "raw" in order to avoid problems with null characters. The
>> input then needs to be split into lines after iconv(), or it could be
>> written to the output file with cat() if the style of line termination
>> characters does not matter. The 'perl = TRUE' is for assumed performance
>> advantage only. It can be removed, or one might want to test if there is
>> a significant difference one way or the other.
>>
>> - Mikko
>>
>> Index: src/library/base/R/files.R
>> ===================================================================
>> --- src/library/base/R/files.R    (revision 70217)
>> +++ src/library/base/R/files.R    (working copy)
>> @@ -50,10 +50,13 @@
>>           for(i in seq_along(files)) {
>>               f <- files[i]
>>               tf <- tempfile()
>> -            tmp <- readLines(f, warn = FALSE)
>> +            tmp <- list(readBin(f, "raw", file.size(f)))
>>               tmp2 <- try(iconv(tmp, encoding, "", "byte"))
>>               if(inherits(tmp2, "try-error")) file.copy(f, tf)
>> -            else writeLines(tmp2, tf)
>> +            else {
>> +                tmp2 <- strsplit(tmp2, "\r\n?|\n", perl = TRUE)[[1L]]
>> +                writeLines(tmp2, tf)
>> +            }
>>               files[i] <- tf
>>               if(delete.file) unlink(f)
>>           }



More information about the R-devel mailing list