[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

nospam at altfeld-im.de nospam at altfeld-im.de
Tue Feb 23 22:53:53 CET 2016


Excellent analysis, thank you both for the quick reply!

Is there anything I can do to get the bug fixed in the next version of R
(e. g. filing a bug report at https://bugs.r-project.org/bugzilla3/)?


On Tue, 2016-02-23 at 14:06 +0200, Mikko Korpela wrote:
> On 23.02.2016 11:37, Martin Maechler wrote:
> >>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
> >>>>>>     on Mon, 22 Feb 2016 18:45:59 +0100 writes:
> > 
> >     > Dear R developers
> >     > I think I have found a bug that can be reproduced with two lines of code
> >     > and I am very thankful to get your first assessment or feed-back on my
> >     > report.
> > 
> >     > If this is the wrong mailing list or I did something wrong
> >     > (e. g. semi "anonymous" email address to protect my privacy and defend
> >     > unwanted spam) please let me know since I am new here.
> > 
> >     > Thank you very much :-)
> > 
> >     > J. Altfeld
> > 
> > Dear J.,
> > (yes, a bit less anonymity would be very welcomed here!),
> > 
> > You are right, this is a bug, at least in the documentation, but
> > probably "all real", indeed,
> > 
> > but read on.
> > 
> >     > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
> >     >> 
> >     >> 
> >     >> If I execute the code from the "?write.table" examples section
> >     >> 
> >     >> x <- data.frame(a = I("a \" quote"), b = pi)
> >     >> # (ommited code)
> >     >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
> >     >> 
> >     >> the resulting CSV file has a size of 6 bytes which is too short
> >     >> (truncated):
> >     >> 
> >     >> """,3
> > 
> > reproducibly, yes.
> > If you look at what write.csv does
> > and then simplify, you can get a similar wrong result by
> > 
> >   write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
> > 
> > which results in a file with one line
> > 
> > """ 3
> > 
> > and if you debug  write.table() you see that its building blocks
> > here are
> > 	 file <- file(........, encoding = fileEncoding)
> > 
> > a 	 writeLines(*, file=file)  for the column headers,
> > 
> > and then "deeper down" C code which I did not investigate.
> 
> I took a look at connections.c. There is a call to strlen() that gets
> confused by null characters. I think the obvious fix is to avoid the
> call to strlen() as the size is already known:
> 
> Index: src/main/connections.c
> ===================================================================
> --- src/main/connections.c	(revision 70213)
> +++ src/main/connections.c	(working copy)
> @@ -369,7 +369,7 @@
>  		/* is this safe? */
>  		warning(_("invalid char string in output conversion"));
>  	    *ob = '\0';
> -	    con->write(outbuf, 1, strlen(outbuf), con);
> +	    con->write(outbuf, 1, ob - outbuf, con);
>  	} while(again && inb > 0);  /* it seems some iconv signal -1 on
>  				       zero-length input */
>      } else
> 
> 
> > 
> > But just looking a bit at such a file() object with writeLines()
> > seems slightly revealing, as e.g., 'eol' does not seem to
> > "work" for this encoding:
> > 
> >     > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
> >     > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
> >     > close(ff)
> >     > file.show(fn)
> >     CBA|>
> >     > file.size(fn)
> >     [1] 5
> >     > 
> 
> With the patch applied:
> 
>     > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>     [1] "C"  "B"  "A"  "|"  ">a"
>     > file.size(fn)
>     [1] 22
> 
> - Mikko Korpela
> 
> >     >> The problem seems to be the iconv function:
> >     >> 
> >     >> iconv("foo", to="UTF-16")
> >     >> 
> >     >> produces
> >     >> 
> >     >> Error in iconv("foo", to = "UTF-16"):
> >     >> embedded nul in string: '\xff\xfef\0o\0o\0'
> > 
> > but this works
> > 
> >     > iconv("foo", to="UTF-16", toRaw=TRUE)
> >     [[1]]
> >     [1] ff fe 66 00 6f 00 6f 00
> > 
> > (indeed showing the embedded '\0's)
> > 
> >     >> In 2010 a (partial) patch for this problem was submitted:
> >     >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html
> > 
> > the patch only related to the iconv() problem not allowing 'raw'
> > (instead of character) argument x.
> > 
> > ... and it is > 5.5 years old, for an iconv() version that was less
> > featureful than today.
> > Rather, current iconv(x) allows x to be a list of raw entries.
> > 
> > 
> >     >> Are there chances to fix this problem since it prevents writing Windows
> >     >> UTF-16LE text files?
> > 
> >     >> 
> >     >> PS: This problem can be reproduced on Windows and Linux.
> > 
> > indeed.... also on "R devel of today".
> > 
> > I agree it should be fixed... but as I said not by the patch you
> > mentioned.
> > 
> > Tested patches to fix this are welcome, indeed.
>



More information about the R-devel mailing list