[Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?

Duncan Murdoch murdoch.duncan at gmail.com
Mon May 1 20:21:59 CEST 2017


On 30/04/2017 12:23 PM, Duncan Murdoch wrote:
> No, I don't think anyone is working on this.
>
> There's a fairly simple workaround for the UTF-16 and UTF-32 iconv
> issues:  don't attempt to produce character vectors, produce raw vectors
> instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors
> can contain embedded nulls.  Character vectors can't, because
> internally, R is using 8 bit C strings, and the nulls are string
> terminators.
>
> I don't know how difficult it would be to fix the write.table problems.

I've now taken a look, and it appears as if it's not too hard.  I'll see 
if I can work out a patch that I trust.

Duncan Murdoch

>
> Duncan Murdoch
>
> On 29/04/2017 7:53 PM, Jack Kelley wrote:
>> "R version 3.4.0 (2017-04-21)"  on "x86_64-w64-mingw32" platform
>>
>> I am using CSVs and other text tables, and text in general (including
>> regular expressions), on Windows 10.
>> For me, that means dealing with Windows-1252 and UTF-8 encoding, with UTF-16
>> and UTF-32 as helpful curiosities.
>>
>> Something as simple as iconv ("\n", to = "UTF-16") causes an error, due to
>> an embedded nul.
>>
>> Then there is write.csv (or write.table) with its fileEncoding parameter:
>> not working correctly for UTF-16 and UTF-32.
>>
>> Of course, developers are aware of this, for example …
>>
>> [Rd] iconv to UTF-16 encoding produces error due to embedded nulls
>> (write.table with fileEncoding param)
>> https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html
>>
>> iconv to UTF-16 encoding produces error due to embedded nulls (write.table
>> with fileEncoding param)
>> http://r.789695.n4.nabble.com/iconv-to-UTF-16-encoding-produces-error-due-to
>> -embedded-nulls-write-table-with-fileEncoding-param-td4717481.html
>>
>> ----------------------------------------------------------------------------
>> ------------------------
>>
>> Focussing on write.csv and UTF-16LE and UTF-16BE, it seems that a nul
>> character is omitted in each <CarriageReturn><LineFeed> pair.
>>
>> TEST SCRIPT
>> ----------------------------------------------------------------------------
>> ------------------------
>> remove (list = objects())
>>
>> print (sessionInfo())
>> cat ("---------------------------------\n\n")
>>
>> LE <- data.frame (
>>   want = c ("0d,00", "0a,00"),
>>   got  = c ("0d   ", "0a,00")
>> )
>>
>> BE <- data.frame (
>>   want = c ("00,0d", "00,0a"),
>>   got  = c ("00,0d", "   0a")
>> )
>>
>> write.csv (LE, "R_LE.csv", fileEncoding = "UTF-16LE", row.names = FALSE)
>> write.csv (BE, "R_BE.csv", fileEncoding = "UTF-16BE", row.names = FALSE)
>>
>> print (readBin ("R_LE.csv", "raw", 1000))
>> print (LE)
>> cat ("\n")
>>
>> print (readBin ("R_BE.csv", "raw", 1000))
>> print (BE)
>> cat ("\n")
>>
>> try (iconv ("\n", to = "UTF-8"))
>>
>> try (iconv ("\n", to = "UTF-16LE"))
>> try (iconv ("\n", to = "UTF-16BE"))
>> try (iconv ("\n", to = "UTF-16"))
>>
>> try (iconv ("\n", to = "UTF-32LE"))
>> try (iconv ("\n", to = "UTF-32BE"))
>> try (iconv ("\n", to = "UTF-32"))
>> ----------------------------------------------------------------------------
>> ------------------------
>>
>> TEST SCRIPT OUTPUT
>>
>>> source ("bug_encoding.R")
>> R version 3.4.0 (2017-04-21)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>> Running under: Windows 10 x64 (build 14393)
>>
>> Matrix products: default
>>
>> locale:
>> [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
>> [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
>> [5] LC_TIME=English_Australia.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] compiler_3.4.0
>> ---------------------------------
>>
>>  [1] 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00
>> 0d
>> [26] 0a 00 22 00 30 00 64 00 2c 00 30 00 30 00 22 00 2c 00 22 00 30 00 64 00
>> 20
>> [51] 00 20 00 20 00 22 00 0d 0a 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00
>> 2c
>> [76] 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 0d 0a 00
>>    want   got
>> 1 0d,00 0d
>> 2 0a,00 0a,00
>>
>>  [1] 00 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22
>> 00
>> [26] 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 64 00 22 00 2c 00 22 00 30 00 30
>> 00
>> [51] 2c 00 30 00 64 00 22 00 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 61 00 22
>> 00
>> [76] 2c 00 22 00 20 00 20 00 20 00 30 00 61 00 22 00 0d 0a
>>    want   got
>> 1 00,0d 00,0d
>> 2 00,0a    0a
>>
>> Error in iconv("\n", to = "UTF-16LE") : embedded nul in string: '\n\0'
>> Error in iconv("\n", to = "UTF-16BE") : embedded nul in string: '\0\n'
>> Error in iconv("\n", to = "UTF-16") : embedded nul in string: 'þÿ\0\n'
>> Error in iconv("\n", to = "UTF-32LE") :
>>   embedded nul in string: '\n\0\0\0'
>> Error in iconv("\n", to = "UTF-32BE") :
>>   embedded nul in string: '\0\0\0\n'
>> Error in iconv("\n", to = "UTF-32") :
>>   embedded nul in string: '\0\0þÿ\0\0\0\n'
>>>
>> ----------------------------------------------------------------------------
>> ------------------------
>> Cheers -- Jack Kelley
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>



More information about the R-devel mailing list