[Rd] read.table() with quoted integers

Fri Oct 4 14:34:20 CEST 2013

On Thu, Oct 3, 2013 at 9:44 AM, Jens Oehlschlägel
<Jens.Oehlschlaegel at truecluster.com> wrote:
> I agree that quoted integer columns are not the most efficient way of
> delivering csv-files. However, the sad reality is that one receives such
> formats and still needs to read the data. Therefore it is not helpful to
> state that one should 'consider "character" to be the correct colClass in
> case an integer is surrounded by quotes'.
>
> The philosophy of read.table.ffdf is delegating the actual csv-parsing to a
> parse engine 'similarly' parametrized like 'read.table'. It is not 'bad
> coding practice' - but a conscious design decision - to assume that the
> parse engine behaves consistently, which read.table does not yet: it
> automatically recognizes a quoted integer column as 'integer', but when
> asked to explicitly interpret the column as 'integer' it does refuse to do

read.table() does not "automatically recognize a quoted integer column
as 'integer'".  If colClasses is not specified, it reads the entire
column into a 'character' vector and then calls type.convert() on it.
type.convert() does all the necessary work to determine what class the
'character' vector should be converted to.  If colClasses is
specified, quotes are not interpreted in non-'character' columns.

You want scan() to allocate an 'integer' vector, and then ensure (on
each read from the column in the file) that the value read is a valid
'integer' type, while interpreting quotes (which strtol does not do,
so someone would have to write and test this new functionality).

So your complaint is more with scan() than read.table().  And more
with Strtoi() (and therefore strtol) than scan().

> so. So there is nothing wrong with read.table.ffdf (but something can be
> improved about read.table). It is *not* the 'best solution [...] to rewrite
> read.table.ffdf()' given that it nicely imports such data, see 4+1 ways to
> do so below.
>
> Jens Oehlschlägel
>
>
> # --- first create a csv file for demonstration
> -------------------------------
> require(ff)
> file <- "test.csv"
> path <- "c:/tmp"
> n <- 1e2
> d <- data.frame(x=1:n, y=shQuote(1:n))
> write.csv(d, file=file.path(path,file), row.names=FALSE, quote=FALSE)
>
> # --- how to do it with read.table.ffdf
> ---------------------------------------
>
> # 1 let the parse engine ignore colClasses and hope for the best
> fixedengine <- function(file, ..., colClasses=NA){
>         read.csv(file, ...)
> }
> df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
> FUN="fixedengine")
> df
>
> # 2 Suspend colClasses(=NA) for the quoted integer column only
> df <- read.csv.ffdf(file=file.path(path,file), first.rows = 10,
> colClasses=c("integer", NA))
> df
>
> # 3 do your own type conversion using transFUN
> #  after reading the problematic column as character
> # Being able to inject regexps is quite powerful isn't it?
> # Or error handlinig in case of varying column format!
> custominterp <- function(d){
>         d[[2]] <- as.integer(gsub('"', '', d[[2]]))
>         d
> }
> df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
> colClasses=c("integer", "character"), FUN="read.csv", transFUN=custominterp)
> df
>
> # 4 do your own line parsing and type conversion
> # Here you can even handle non-standard formats
> #  such as varying number of columns
> customengine <- function(file, header=TRUE, col.names, colClasses=NA,
> nrows=0, skip=0, fileEncoding="", comment.char = ""){
>         l <- scan(file, what="character", nlines=nrows+header, skip=skip,
> fileEncoding=fileEncoding, comment.char = comment.char)
>         s <- do.call("rbind", strsplit(l, ","))
>         if (header){
>                 d <- data.frame(as.integer(s[-1,1]),
> as.integer(gsub('"','',s[-1,2])))
>                 names(d) <- s[1,]
>         }else{
>                 d <- data.frame(as.integer(s[,1]),
> as.integer(gsub('"','',s[,2])))
>         }
>         if (!missing(col.names))
>                 names(d) <- col.names
>         d
> }
> df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
> FUN="customengine")
> df
>
> # 5 use a parsing engine that can apply colClasses to quoted integers
> # Unfortunately Henry Bengtson's readDataFrame does not work as a
> #  parse engine for read.table.ffdf because read.table.ffdf expects
> #  the parse engine to read successive chunks from a file connection
> #  while readDataFrame only accepts a filename as input file spec.
> # Yes it has 'skip', but using that would reread the file from scratch
> #  for each chunk (O(N^2) costs)
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com