[Rd] gzfile & read.table on Win32

Henrik Bengtsson hb at maths.lth.se
Tue Mar 16 22:20:44 MET 2004


Hi, I ran into a the same problem some time ago, but I still haven't
had time to troubleshoot it very much. However, I found out that it
has to do with newlines at the end of the files. Here's an example
that might give some initial clues:

# Creating two example files:
cat("1 2\n3 4\n5 6\n7 8\n9 10\n11 12\n", file="tableBad.txt")
cat("1 2\n3 4\n5 6\n7 8\n9 10\n11 12", file="tableOk.txt")

# A first simple example
df1 <- read.table("tableOk.txt")
df2 <- read.table("tableBad.txt")
if (!identical(df1,df2)) cat("df1 != df2\n")

# Then...
df3 <- read.table(gzfile("tableOk.txt"))
if (!identical(df1,df3)) cat("df1 != df3\n")
# Gives: df1 != df3

# and...
df4 <- read.table(gzfile("tableBad.txt"))
# Warning message:
# number of items read is not a multiple of the number of columns
if (!identical(df1,df4)) cat("df1 != df4\n")
# Gives: df1 != df4
if (!identical(df3,df4)) cat("df3 != df4\n")
# Gives: df3 != df4

# Details:
str(df1)
# `data.frame':   6 obs. of  2 variables:
# $ V1: int  1 3 5 7 9 11
# $ V2: int  2 4 6 8 10 12

str(df3)
# `data.frame':   6 obs. of  2 variables:
# $ V1: int  1 3 5 7 9 11
# $ V2: Factor w/ 6 levels "10","12 ","2",..: 3 4 5 6 1 2
as.character(df3$V2)
# [1] "2"   "4"   "6"   "8"   "10"  "12 "   # Note the " "

str(df4)
# `data.frame':   7 obs. of  2 variables:
# $ V1: Factor w/ 7 levels "1","11","3","5",..: 1 3 4 5 6 2 7
# $ V2: int  2 4 6 8 10 12 NA
as.character(df4$V1)
# [1] "1"  "3"  "5"  "7"  "9"  "11" " "     # Note the " "
as.character(df4$V2)
# [1] "2"  "4"  "6"  "8"  "10" "12" NA

# Note that the " " is not a space, but 
#  i) Sun Solaris 8: ASCII 24/0x18/030 
identical(as.character(df4$V1[7]), "\030")
# ii) WinXP: ASCII 255/0xFF/0377
identical(as.character(df4$V1[7]), "\377")

This was done on: 
R v1.8.1 & R v1.9.0alpha on WinXP, and 
R v1.8.1 on Sun Solaris 8

Now back to my other work ;)

Cheers

Henrik Bengtsson

> -----Original Message-----
> From: r-devel-bounces at stat.math.ethz.ch 
> [mailto:r-devel-bounces at stat.math.ethz.ch] On Behalf Of Jeff Gentry
> Sent: den 15 mars 2004 22:17
> To: r-devel at stat.math.ethz.ch
> Subject: [Rd] gzfile & read.table on Win32
> 
> 
> Hello ...
> 
> Are there any known problems or even gotchas to look out for 
> when using a gzfile connection in read.csv/read.table in Windows?
> 
> In the package PROcess, available at 
> www.bioconductor.org/repository/devel/package/html/PROcess.html
> there are two files in the PROcess/inst/Test directory which 
> are of the extension *.csv.gz.
> 
> With both files, if I open up a gzfile connection, say:
> vv <- gzfile("122402imac40-s-c-192combined i11.csv.gz")
> I can then do:
> readLines(vv, n=10)
> 
> And it works as expected.  However, if I do this:
> 
> read.csv(vv)
> 
> I get a warning:
> Warning: incomplete final line found by readTableHeader on 
> `c:/repository/checks/PROcess.Rcheck/PROcess/Test/122402imac40
> -s-c-192combined
> i11.csv.gz'
> 
> and the results of the read.table are completely broken 
> (basically it returns a 0 row matrix, with one column (with 
> the first column name listed in the csv file).  Furthermore, 
> the connection variable itself seems to get mangled in the 
> process, if I type the variable name (e.g. 'vv' from above), I get:
> > vv
> Error in summary.connection(x) : invalid connection
> 
> Note that if I manually gunzip the file and then do a 
> 'read.csv' in R, everything works properly - so it doesn't 
> appear to be the actual file itself, but somehow related to 
> reading it in as a compressed file.
> 
> This is showing up both on R-1.8.1 and R-devel (admittedly a 
> bit out of date, currently using 2004-03-08 and am trying to 
> update on Windows now).
> 
> Thanks
> -J
> 
> ______________________________________________
> R-devel at stat.math.ethz.ch mailing list 
> https://www.stat.math.ethz.ch/mailma> n/listinfo/r-devel
> 
>



More information about the R-devel mailing list