[Rd] read.table() with quoted integers

Milan Bouchet-Valat nalimilan at club.fr
Mon Sep 30 17:27:39 CEST 2013


Le lundi 30 septembre 2013 à 17:10 +0200, Joris Meys a écrit :
> Regardless of whether "stored as character" is interpreted the R way
> or the ASCII way, the point Joshua makes is rather valid. Mainly
> because read.table has an argument quote with default value \"'. This
> means that at least according to R, everything between either " or '
> should be seen as of type character and not integer. 
I don't think the problem is related to the quote argument at all:
> read.table("file.csv", colClasses="integer", quote=NULL)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
  scan() expected 'an integer' and got '"1"'

> The only way these quotes can end up in a .csv file, is when in the
> rendering program (often Excel), these integers are called "character"
> inside the program as well. So they're not treated as integers by the
> person that created the file, so R won't treat them
> as integers either. Note that read.table does read the quoted integers
> as characters, and only afterwards convert those.
Yeah, I understand how the conversion happens, but I wonder whether the
result really makes sense. The fact that you cannot set colClasses to
the classes you are actually getting when reading the file is somewhat
disturbing...

> So yes, this is an issue with read.table.ffdf more than with R itself.
> And the problem is indeed how integers are treated the moment they are
> stored. This refering to the presence/absence of the quote character.
Of course this could be fixed in read.table.ffdf(), but that would be
quite hacky since it could not cleanly rely as currently on
read.table(): it would need to read the file directly to check whether
the fields are quoted or not (since the result of read.table() does not
allow distinguishing their presence). To me this tends to indicate
something is wrong in the way read.table() works.

FWIW, changing the behavior of read.table() to skip quotes when
colClasses="integer" would not break any existing program since it would
only avoid an error where it previously happened, without modifying
working cases.


Regards

> 
> Regards
> Joris
> 
> 
> On Mon, Sep 30, 2013 at 4:45 PM, Milan Bouchet-Valat
> <nalimilan at club.fr> wrote:
>         Le lundi 30 septembre 2013 à 08:38 -0500, Joshua Ulrich a
>         écrit :
>         > On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat
>         <nalimilan at club.fr> wrote:
>         > > Hi!
>         > >
>         > >
>         > > It seems that read.table() in R 3.0.1 (Linux 64-bit) does
>         not consider
>         > > quoted integers as an acceptable value for columns for
>         which
>         > > colClasses="integer". But when colClasses is omitted,
>         these columns are
>         > > read as integer anyway.
>         > >
>         > > For example, let's consider a file named file.dat,
>         containing:
>         > > "1"
>         > > "2"
>         > >
>         > >> read.table("file.dat", colClasses="integer")
>         > > Error in scan(file, what, nmax, sep, dec, quote, skip,
>         nlines, na.strings, :
>         > >   scan() expected 'an integer' and got '"1"'
>         > >
>         > > But:
>         > >> str(read.table("file.dat"))
>         > > 'data.frame':   2 obs. of  1 variable:
>         > >  $ V1: int  1 2
>         > >
>         > > The latter result is indeed documented in ?read.table:
>         > >      Unless ‘colClasses’ is specified, all columns are
>         read as
>         > >      character columns and then converted using
>         ‘type.convert’ to
>         > >      logical, integer, numeric, complex or (depending on
>         ‘as.is’)
>         > >      factor as appropriate.  Quotes are (by default)
>         interpreted in all
>         > >      fields, so a column of values like ‘"42"’ will result
>         in an
>         > >      integer column.
>         > >
>         > >
>         > > Should the former behavior be considered a bug?
>         > >
>         > No. If you tell read.table the column is integer and it's
>         actually
>         > character on disk, it should be an error.
>         
>         All values in a CSV file are stored as characters on disk,
>         disregarding
>         the fact that they are surrounded by quotes or not. 1 is saved
>         as
>         00110001 (ASCII character #49), not 00000001, nor 00000000
>         00000000
>         00000000 00000001 (as would for example imply a 32 bit storage
>         of
>         integers).
>         
>         So, with all due respect, please refrain from formulating such
>         blatantly
>         erroneous statements.
>         
>         
>         Regards
>         
>         
>         > > This creates problems when combined with read.table.ffdf
>         from package
>         > > ff, since this function tries to guess the column classes
>         by reading the
>         > > first rows of the file, and then passes colClasses to
>         read.table to read
>         > > the remaining rows by chunks. A column of quoted integers
>         is correctly
>         > > detected as integer in the first read, but read.table()
>         fails in
>         > > subsequent reads.
>         > >
>         > This sounds like a issue with read.table.ffdf.  The column
>         of quoted
>         > integers is *incorrectly* detected as integer because
>         they're actually
>         > character on disk.  read.table.ffdf should rely on how the
>         data are
>         > actually stored on disk (via as.is=TRUE), not how read.table
>         might
>         > convert them once they're read into R.
>         >
>         > >
>         > > Regards
>         > >
>         > > ______________________________________________
>         > > R-devel at r-project.org mailing list
>         > > https://stat.ethz.ch/mailman/listinfo/r-devel
>         >
>         > --
>         > Joshua Ulrich  |  about.me/joshuaulrich
>         > FOSS Trading  |  www.fosstrading.com
>         
>         ______________________________________________
>         R-devel at r-project.org mailing list
>         https://stat.ethz.ch/mailman/listinfo/r-devel
>         
> 
> 
> 
> 
> -- 
> Joris Meys
> Statistical consultant
> 
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
> 
> tel : +32 9 264 59 87
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php



More information about the R-devel mailing list