[Rd] read.csv

Tue Jun 16 20:09:01 CEST 2009

On Sun, Jun 14, 2009 at 09:21:24PM +0100, Ted Harding wrote:
> On 14-Jun-09 18:56:01, Gabor Grothendieck wrote:
> > If read.csv's colClasses= argument is NOT used then read.csv accepts
> > double quoted numerics:
> > 
> > 1: > read.csv(stdin())
> > 0: A,B
> > 1: "1",1
> > 2: "2",2
> > 3:
> >   A B
> > 1 1 1
> > 2 2 2
> > 
> > However, if colClasses is used then it seems that it does not:
> > 
> >> read.csv(stdin(), colClasses = "numeric")
> > 0: A,B
> > 1: "1",1
> > 2: "2",2
> > 3:
> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> > na.strings,  :
> >   scan() expected 'a real', got '"1"'
> > 
> > Is this really intended?  I would have expected that a csv file
> > in which each field is surrounded with double quotes is acceptable
> > in both cases. This may be documented as is yet seems undesirable
> > from both a consistency viewpoint and the viewpoint that it should
> > be possible to double quote fields in a csv file.
> 
> Well, the default for colClasses is NA, for which ?read.csv says:
>   [...]
>   Possible values are 'NA' (when 'type.convert' is used),
>   [...]
> and then ?type.convert says:
>   This is principally a helper function for 'read.table'. Given a
>   character vector, it attempts to convert it to logical, integer,
>   numeric or complex, and failing that converts it to factor unless
>   'as.is = TRUE'.  The first type that can accept all the non-missing
>   values is chosen.
> 
> It would seem that type 'logical' won't accept integer (naively one
> might expect 1 --> TRUE, but see experiment below), so the first
> acceptable type for "1" is integer, and that is what happens.
> So it is indeed documented (in the R[ecursive] sense of "documented" :))
> 
> However, presumably when colClasses is used then type.convert() is
> not called, in which case R sees itself being asked to assign a
> character entity to a destination which it has been told shall be
> integer, and therefore, since the default for as.is is
>   as.is = !stringsAsFactors
> but for this ?read.csv says that stringsAsFactors "is overridden
> bu [sic] 'as.is' and 'colClasses', both of which allow finer
> control.", so that wouldn't come to the rescue either.
> 
> Experiment:
>   X <-logical(10)
>   class(X)
>   # [1] "logical"
>   X[1]<-1
>   X
>   # [1] 1 0 0 0 0 0 0 0 0 0
>   class(X)
>   # [1] "numeric"
> so R has converted X from class 'logical' to class 'numeric'
> on being asked to assign a number to a logical; but in this
> case its hands were not tied by colClasses.
> 
> Or am I missing something?!!

In my opinion, you explain, how it happens that there is a difference
in the behavior between
  read.csv(stdin(), colClasses = "numeric")
and
  read.csv(stdin())
but not, why it is so.

The algorithm "use the smallest type, which accepts all non-missing values"
may well be applied to the input values either literally or after removing
the quotes. Is there a reason, why
  read.csv(stdin(), colClasses = "numeric")
removes quotes from the input values and
  read.csv(stdin())
does not?

Using double-quote characters is a part of the definition of CSV file, see,
for example
  http://en.wikipedia.org/wiki/Comma_separated_values
where one may find
  Fields may always be enclosed within double-quote characters, whether necessary or not.

Petr.