[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Martin Maechler maechler at stat.math.ethz.ch
Tue Feb 23 10:37:38 CET 2016


>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>     on Mon, 22 Feb 2016 18:45:59 +0100 writes:

    > Dear R developers
    > I think I have found a bug that can be reproduced with two lines of code
    > and I am very thankful to get your first assessment or feed-back on my
    > report.

    > If this is the wrong mailing list or I did something wrong
    > (e. g. semi "anonymous" email address to protect my privacy and defend
    > unwanted spam) please let me know since I am new here.

    > Thank you very much :-)

    > J. Altfeld

Dear J.,
(yes, a bit less anonymity would be very welcomed here!),

You are right, this is a bug, at least in the documentation, but
probably "all real", indeed,

but read on.

    > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
    >> 
    >> 
    >> If I execute the code from the "?write.table" examples section
    >> 
    >> x <- data.frame(a = I("a \" quote"), b = pi)
    >> # (ommited code)
    >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
    >> 
    >> the resulting CSV file has a size of 6 bytes which is too short
    >> (truncated):
    >> 
    >> """,3

reproducibly, yes.
If you look at what write.csv does
and then simplify, you can get a similar wrong result by

  write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")

which results in a file with one line

""" 3

and if you debug  write.table() you see that its building blocks
here are
	 file <- file(........, encoding = fileEncoding)

a 	 writeLines(*, file=file)  for the column headers,

and then "deeper down" C code which I did not investigate.

But just looking a bit at such a file() object with writeLines()
seems slightly revealing, as e.g., 'eol' does not seem to
"work" for this encoding:

    > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
    > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
    > close(ff)
    > file.show(fn)
    CBA|>
    > file.size(fn)
    [1] 5
    > 

    >> The problem seems to be the iconv function:
    >> 
    >> iconv("foo", to="UTF-16")
    >> 
    >> produces
    >> 
    >> Error in iconv("foo", to = "UTF-16"):
    >> embedded nul in string: '\xff\xfef\0o\0o\0'

but this works

    > iconv("foo", to="UTF-16", toRaw=TRUE)
    [[1]]
    [1] ff fe 66 00 6f 00 6f 00

(indeed showing the embedded '\0's)

    >> In 2010 a (partial) patch for this problem was submitted:
    >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html

the patch only related to the iconv() problem not allowing 'raw'
(instead of character) argument x.

... and it is > 5.5 years old, for an iconv() version that was less
featureful than today.
Rather, current iconv(x) allows x to be a list of raw entries.


    >> Are there chances to fix this problem since it prevents writing Windows
    >> UTF-16LE text files?

    >> 
    >> PS: This problem can be reproduced on Windows and Linux.

indeed.... also on "R devel of today".

I agree it should be fixed... but as I said not by the patch you
mentioned.

Tested patches to fix this are welcome, indeed.

Martin Maechler



    >> ---------------
    >> 
    >> > sessionInfo()
    >> R version 3.2.3 (2015-12-10)
    >> Platform: x86_64-pc-linux-gnu (64-bit)
    >> Running under: Ubuntu 14.04.3 LTS
    >> 
    >> locale:
    >> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
    >> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
    >> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
    >> LC_PAPER=en_US.UTF-8       LC_NAME=C                 
    >> [9] LC_ADDRESS=C               LC_TELEPHONE=C
    >> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    >> 
    >> attached base packages:
    >> [1] stats     graphics  grDevices utils     datasets  methods
    >> base     
    >> 
    >> loaded via a namespace (and not attached):
    >> [1] tools_3.2.3
    >> >
    >> 
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list