[Rd] read.csv

Petr Savicky savicky at cs.cas.cz
Thu Jun 25 11:23:25 CEST 2009


On Sun, Jun 14, 2009 at 02:56:01PM -0400, Gabor Grothendieck wrote:
> If read.csv's colClasses= argument is NOT used then read.csv accepts
> double quoted numerics:
> 
> 1: > read.csv(stdin())
> 0: A,B
> 1: "1",1
> 2: "2",2
> 3:
>   A B
> 1 1 1
> 2 2 2
> 
> However, if colClasses is used then it seems that it does not:
> 
> > read.csv(stdin(), colClasses = "numeric")
> 0: A,B
> 1: "1",1
> 2: "2",2
> 3:
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   scan() expected 'a real', got '"1"'
> 
> Is this really intended?  I would have expected that a csv file in which
> each field is surrounded with double quotes is acceptable in both
> cases.   This may be documented as is yet seems undesirable from
> both a consistency viewpoint and the viewpoint that it should be
> possible to double quote fields in a csv file.

The problem is not specific to read.csv(). The same difference appears
for read.table().
  read.table(stdin())
  "1" 1
  2 "2"
  
  #   V1 V2
  # 1  1  1
  # 2  2  2
but
  read.table(stdin(), colClasses = "numeric")
  "1" 1
  2 "2"
  
  Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '"1"'

The error occurs in the call of scan() at line 152 in src/library/utils/R/readtable.R,
which is
  data <- scan(file = file, what = what, sep = sep, quote = quote, ...
(This is the third call of scan() in the source code of read.table())

In this call, scan() gets the types of columns in "what" argument. If the type 
is specified, scan() performs the conversion itself and fails, if a numeric field
is quoted. If the type is not specified, the output of scan() is of type character,
but with quotes eliminated, if there are some in the input file. Columns with
unknown type are then converted using type.convert(), which receives the data
already without quotes.

The call of type.convert() is contained in a cycle
    for (i in (1L:cols)[do]) {
        data[[i]] <-
            if (is.na(colClasses[i]))
                type.convert(data[[i]], as.is = as.is[i], dec = dec,
                             na.strings = character(0L))
        ## as na.strings have already been converted to <NA>
            else if (colClasses[i] == "factor") as.factor(data[[i]])
            else if (colClasses[i] == "Date") as.Date(data[[i]])
            else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
            else methods::as(data[[i]], colClasses[i])
    }
which contains also lines, which could perform conversion for columns with
a specified type, but these lines are not used, since the vector "do" 
is defined as
  do <- keep & !known 
where "known" determines for which columns the type is known.

It is possible to modify the code so that scan() is called with all types
unspecified and leave the conversion to the lines
            else if (colClasses[i] == "factor") as.factor(data[[i]])
            else if (colClasses[i] == "Date") as.Date(data[[i]])
            else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
            else methods::as(data[[i]], colClasses[i])
above. Since this solution is already prepared in the code, the patch is very simple
  --- R-devel/src/library/utils/R/readtable.R     2009-05-18 17:53:08.000000000 +0200
  +++ R-devel-readtable/src/library/utils/R/readtable.R   2009-06-25 10:20:06.000000000 +0200
  @@ -143,9 +143,6 @@
       names(what) <- col.names
   
       colClasses[colClasses %in% c("real", "double")] <- "numeric"
  -    known <- colClasses %in%
  -                c("logical", "integer", "numeric", "complex", "character")
  -    what[known] <- sapply(colClasses[known], do.call, list(0))
       what[colClasses %in% "NULL"] <- list(NULL)
       keep <- !sapply(what, is.null)
   
  @@ -189,7 +186,7 @@
          stop(gettextf("'as.is' has the wrong length %d  != cols = %d",
                        length(as.is), cols), domain = NA)
   
  -    do <- keep & !known # & !as.is
  +    do <- keep & !as.is
       if(rlabp) do[1L] <- FALSE # don't convert "row.names"
       for (i in (1L:cols)[do]) {
           data[[i]] <-
(Also in attachment)

I did a test as follows
  d1 <- read.table(stdin())
  "1" TRUE   3.5
  2   NA     "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d1, typeof)
  #        V1        V2        V3 
  # "integer" "logical"  "double" 
  is.na(d1)
  #         V1    V2    V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE
  
  d2 <- read.table(stdin(), colClasses=c("integer", "logical", "double"))
  "1" TRUE   3.5
  2   NA     "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d2, typeof)
  #        V1        V2        V3 
  # "integer" "logical"  "double" 
  is.na(d2)
  #         V1    V2    V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE

I think, there was a reason to let scan() to perform the type conversion, for
example, it may be more efficient. So, if correct, the above patch is a possible
solution, but some other may be more appropriate. In particular, function scan()
may be modified to remove quotes also from fields specified as numeric.

Petr.



More information about the R-devel mailing list