[R] "read.table" and "scan" skips newlines which "count.fields" finds in Thai textfile

Andzsin andzsinszan at gmail.com
Wed Feb 3 05:41:35 CET 2010


Hi there,

I have some problems reading in a Thai text.
Some of the newlines are skipped.

(see the contents of my file below)

R>count.fields ("my.txt",  sep='\n', quote="")
[1] 1 1 1

Three lines with one item each, right?

R> scan("my.txt", what="", sep="\t", quote="")
Read 2 items
[1] "犹\x83犧癌ケ\x88 犧\x84犧」犧ア犧\x9a\n犹\x83犧癌ケ\x88 犧\x84犹謂クー" 
[2] "犹\x83犧癌ケ\x88 犧\x84犧」犧ア犧\x9a\n"  

Two items. Note that arguments to "count.fields" and "scan" are the same.
There is a newline within the first item ("\n").

> read.table("my.txt", encoding="UTF-8", header=F, sep="\t", quote="")
[1] V1
<0 rows> (or 0-length row.names)

Zero items.
I just reduced my file to zero.

Needless to say, my editors show 3 lines (Vim, Em, Hidemaru)
Hex dump shows the newline chars clearly (see below).

I have seen related questions but not the solution: 
e.g.
http://n4.nabble.com/problem-with-scan-recognizing-newline-n-td896114.html#a896115


And just for fun : 
R>read.table("my.txt", encoding="justkidding")
[1] V1
<0 rows> (or 0-length row.names)

Its funny NOT to see any complaints about "justkidding" encoding...
(it is so not R-ish :-)

We are using R2.8 -> for a while we are stuck with it.
(I briefly installed R2.10 but did not seem to overcome the problem)

Any kind of help is greatly appreciated.

Best,

andzsin


ps : replacing Thai with Japanese text (same utf-8) had slightly different
results
 (only some of the newlines were ignored)

********   Details:  *************

name : my.txt
lg    : Thai
enc : UTF-8
EOL : CR+LF (0d0a)
content : 
ใช่ ครับ
ใช่ ค่ะ
ใช่ ครับ
[EOF]

HEX :   <copy-paste to some prg that goes with fixed-width chars>

00000000  e0b9 83e0 b88a e0b9 8820 e0b8 84e0 b8a3 `9.`8.`9. `8.`8#
00000010  e0b8 b1e0 b89a 0d0a e0b9 83e0 b88a e0b9 `81`8...`9.`8.`9
                         ^^^^	
00000020  8820 e0b8 84e0 b988 e0b8 b00d 0ae0 b983 . `8.`9.`80..`9.
                                     ^^^^^
00000030  e0b8 8ae0 b988 20e0 b884 e0b8 a3e0 b8b1 `8.`9. `8.`8#`81
00000040  e0b8 9a0d 0a                            `8...
                 ^^^^^


R>sessionInfo()
R version 2.8.0 (2008-10-20) 
i386-pc-mingw32 

locale:
LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932





-- 
View this message in context: http://n4.nabble.com/read-table-and-scan-skips-newlines-which-count-fields-finds-in-Thai-textfile-tp1460736p1460736.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list