[R] Can't read table encoded in Unicode (R-2.8.1)

Hilmar Berger hilmar.berger at gmx.de
Sat Apr 18 19:18:06 CEST 2009


Hi all,

I have problems reading Unicode (UTF-16) coded tables in R 2.8.1 under 
Windows Vista.

Imagine the following table:

a    b    c    d
X    1,2    1,3    1,4
Y    2,2    2,3    2,4
Z    3,2    3,3    3,4

Usually I would use the following code to read the table:

t = read.table("test.txt", header=T, sep="\t",dec=",")

This works well if I create the table using Notepad (the text will be in 
UTF-8 or ASCII, then).

However, If I use e.g. OpenOffice scalc to create a spreadsheet holding 
the same data and save this data as text (using tabs as separators, no 
quotes and using Unicode encoding)  the command above gives this:

 > t = read.table("test.csv", header=T, sep="\t",dec=",")
 > t
  ÿþa
1  NA
2  NA
3  NA

I tried to play with the "encoding" parameter but that would not change 
anything.

The file from OpenOffice is in UTF-16, as shown by hexdump:
$ hexdump test.csv
0000000 feff 0061 0009 0062 0009 0063 0009 0064
0000010 000d 000a 0058 0009 0031 002c 0032 0009
0000020 0031 002c 0033 0009 0031 002c 0034 000d
0000030 000a 0059 0009 0032 002c 0032 0009 0032
0000040 002c 0033 0009 0032 002c 0034 000d 000a
0000050 005a 0009 0033 002c 0032 0009 0033 002c
0000060 0033 0009 0033 002c 0034 000d 000a
000006e

I tried to read the file using file/readLines, which seemed to work 
after specifying the encoding:

 > a = file("test.csv",open="r", encoding="UTF-16")
 > b = readLines(a)
 > b
[1] "a\tb\tc\td"       "X\t1,2\t1,3\t1,4" "Y\t2,2\t2,3\t2,4" 
"Z\t3,2\t3,3\t3,4"

Looking at the code of readtable.R in R-2.8.1. and R-2.9.0 it seems that 
the encoding does not get passed through in the second call to scan() 
appearing in the code.

I'm not sure if this is a bug or if I'm doing something wrong here.

Regards,
Hilmar

------------------
My system  and R settings are:

 > sessionInfo()
R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

loaded via a namespace (and not attached):
[1] tools_2.8.1

 > Sys.info()
                     sysname                      
release                      version                     nodename
                   "Windows"                      "Vista" "build 6001, 
Service Pack 1"                  "PC"
                     machine                        
login                         user
                       "x86"  

 > options("encoding")
$encoding
[1] "native.enc"




More information about the R-help mailing list