[R] Can't read table encoded in Unicode (R-2.8.1)

Hilmar Berger hilmar.berger at gmx.de
Sat Apr 18 22:40:11 CEST 2009


Hi Duncan,
Thanks, this solves my problem.
Regards, Hilmar
 
Duncan Murdoch schrieb:
> On 18/04/2009 1:18 PM, Hilmar Berger wrote:
>> Hi all,
>>
>> I have problems reading Unicode (UTF-16) coded tables in R 2.8.1 
>> under Windows Vista.
>>
>> Imagine the following table:
>>
>> a    b    c    d
>> X    1,2    1,3    1,4
>> Y    2,2    2,3    2,4
>> Z    3,2    3,3    3,4
>>
>> Usually I would use the following code to read the table:
>>
>> t = read.table("test.txt", header=T, sep="\t",dec=",")
>>
>> This works well if I create the table using Notepad (the text will be 
>> in UTF-8 or ASCII, then).
>
> I haven't tried 2.8.1 (which is obsolete, since yesterday :-), but in 
> 2.9.0 it works fine if I use the fileEncoding argument to read.table.
>
> Duncan Murdoch
>
>
>> However, If I use e.g. OpenOffice scalc to create a spreadsheet 
>> holding the same data and save this data as text (using tabs as 
>> separators, no quotes and using Unicode encoding)  the command above 
>> gives this:
>>
>>  > t = read.table("test.csv", header=T, sep="\t",dec=",")
>>  > t
>>   ÿþa
>> 1  NA
>> 2  NA
>> 3  NA
>>
>> I tried to play with the "encoding" parameter but that would not 
>> change anything.
>>
>> The file from OpenOffice is in UTF-16, as shown by hexdump:
>> $ hexdump test.csv
>> 0000000 feff 0061 0009 0062 0009 0063 0009 0064
>> 0000010 000d 000a 0058 0009 0031 002c 0032 0009
>> 0000020 0031 002c 0033 0009 0031 002c 0034 000d
>> 0000030 000a 0059 0009 0032 002c 0032 0009 0032
>> 0000040 002c 0033 0009 0032 002c 0034 000d 000a
>> 0000050 005a 0009 0033 002c 0032 0009 0033 002c
>> 0000060 0033 0009 0033 002c 0034 000d 000a
>> 000006e
>>
>> I tried to read the file using file/readLines, which seemed to work 
>> after specifying the encoding:
>>
>>  > a = file("test.csv",open="r", encoding="UTF-16")
>>  > b = readLines(a)
>>  > b
>> [1] "a\tb\tc\td"       "X\t1,2\t1,3\t1,4" "Y\t2,2\t2,3\t2,4" 
>> "Z\t3,2\t3,3\t3,4"
>>
>> Looking at the code of readtable.R in R-2.8.1. and R-2.9.0 it seems 
>> that the encoding does not get passed through in the second call to 
>> scan() appearing in the code.
>>
>> I'm not sure if this is a bug or if I'm doing something wrong here.
>>
>> Regards,
>> Hilmar
>>
>> ------------------
>> My system  and R settings are:
>>
>>  > sessionInfo()
>> R version 2.8.1 (2008-12-22)
>> i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252 
>>
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base   
>> loaded via a namespace (and not attached):
>> [1] tools_2.8.1
>>
>>  > Sys.info()
>>                      sysname                      
>> release                      version                     nodename
>>                    "Windows"                      "Vista" "build 
>> 6001, Service Pack 1"                  "PC"
>>                      machine                        
>> login                         user
>>                        "x86" 
>>  > options("encoding")
>> $encoding
>> [1] "native.enc"
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>




More information about the R-help mailing list