[Rd] read.table() with quoted integers

Fri Oct 4 16:01:46 CEST 2013

Le vendredi 04 octobre 2013 à 07:55 -0400, Duncan Murdoch a écrit :
> On 13-10-04 7:31 AM, Joshua Ulrich wrote:
> > On Tue, Oct 1, 2013 at 11:29 AM, David Winsemius <dwinsemius at comcast.net> wrote:
> >>
> >> On Sep 30, 2013, at 6:38 AM, Joshua Ulrich wrote:
> >>
> >>> On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> >>>> Hi!
> >>>>
> >>>>
> >>>> It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
> >>>> quoted integers as an acceptable value for columns for which
> >>>> colClasses="integer". But when colClasses is omitted, these columns are
> >>>> read as integer anyway.
> >>>>
> >>>> For example, let's consider a file named file.dat, containing:
> >>>> "1"
> >>>> "2"
> >>>>
> >>>>> read.table("file.dat", colClasses="integer")
> >>>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
> >>>>   scan() expected 'an integer' and got '"1"'
> >>>>
> >>>> But:
> >>>>> str(read.table("file.dat"))
> >>>> 'data.frame':   2 obs. of  1 variable:
> >>>> $ V1: int  1 2
> >>>>
> >>>> The latter result is indeed documented in ?read.table:
> >>>>      Unless ‘colClasses’ is specified, all columns are read as
> >>>>      character columns and then converted using ‘type.convert’ to
> >>>>      logical, integer, numeric, complex or (depending on ‘as.is’)
> >>>>      factor as appropriate.  Quotes are (by default) interpreted in all
> >>>>      fields, so a column of values like ‘"42"’ will result in an
> >>>>      integer column.
> >>>>
> >>>>
> >>>> Should the former behavior be considered a bug?
> >>>>
> >>> No. If you tell read.table the column is integer and it's actually
> >>> character on disk, it should be an error.
> >>
> >> My reading of the `read.table` help page is that one should expect that when
> >> there is an 'integer'-class and an  `as.integer` function and  "integer" is the
> >> argument to colClasses, that `as.integer` will be applied to the values in the
> >> column. Should I be reading elsewhere?
> >>
> > I assume you're referring to the paragraph below.
> >
> >    Possible values are ‘NA’ (the default, when ‘type.convert’ is
> >    used), ‘"NULL"’ (when the column is skipped), one of the
> >    atomic vector classes (logical, integer, numeric, complex,
> >    character, raw), or ‘"factor"’, ‘"Date"’ or ‘"POSIXct"’.
> >    Otherwise there needs to be an ‘as’ method (from package
> >    ‘methods’) for conversion from ‘"character"’ to the specified
> >    formal class.
> >
> > I read that as meaning that an "as" method is required for classes not
> > already listed in the prior sentence.  It doesn't say an "as" method
> > will be applied if colClasses is one of the atomic, factor, Date, or
> > POSIXct classes; but I can see how you might assume that, since all
> > the atomic, factor, Date, and POSIXct classes already have "as"
> > methods...
> 
> And this does suggest a workaround for ffdf:  instead of declaring the 
> class to be "integer", declare a class "ffdf_integer", and write a 
> conversion method.  Or simply read everything as character and call 
> as.integer() explicitly.
This is indeed an interesting workaround for read.table.ffdf(), thanks!

I still think adapting the behavior of scan() would be an interesting
improvement for R users, though.

Regards