[Rd] read.table() with quoted integers

Fri Oct 4 18:28:57 CEST 2013

On Fri, Oct 4, 2013 at 9:15 AM, peter dalgaard <pdalgd at gmail.com> wrote:
>
> On Oct 4, 2013, at 17:10 , Henrik Bengtsson wrote:
>
>> On Fri, Oct 4, 2013 at 4:55 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
>>> On 13-10-04 7:31 AM, Joshua Ulrich wrote:
>>>>
>>>> On Tue, Oct 1, 2013 at 11:29 AM, David Winsemius <dwinsemius at comcast.net>
>>>> wrote:
>>>>>
>>>>>
>>>>> On Sep 30, 2013, at 6:38 AM, Joshua Ulrich wrote:
>>>>>
>>>>>> On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimilan at club.fr>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>>
>>>>>>> It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
>>>>>>> quoted integers as an acceptable value for columns for which
>>>>>>> colClasses="integer". But when colClasses is omitted, these columns are
>>>>>>> read as integer anyway.
>>>>>>>
>>>>>>> For example, let's consider a file named file.dat, containing:
>>>>>>> "1"
>>>>>>> "2"
>>>>>>>
>>>>>>>> read.table("file.dat", colClasses="integer")
>>>>>>>
>>>>>>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>>>>>> na.strings, :
>>>>>>>  scan() expected 'an integer' and got '"1"'
>>>>>>>
>>>>>>> But:
>>>>>>>>
>>>>>>>> str(read.table("file.dat"))
>>>>>>>
>>>>>>> 'data.frame':   2 obs. of  1 variable:
>>>>>>> $ V1: int  1 2
>>>>>>>
>>>>>>> The latter result is indeed documented in ?read.table:
>>>>>>>     Unless ‘colClasses’ is specified, all columns are read as
>>>>>>>     character columns and then converted using ‘type.convert’ to
>>>>>>>     logical, integer, numeric, complex or (depending on ‘as.is’)
>>>>>>>     factor as appropriate.  Quotes are (by default) interpreted in all
>>>>>>>     fields, so a column of values like ‘"42"’ will result in an
>>>>>>>     integer column.
>>>>>>>
>>>>>>>
>>>>>>> Should the former behavior be considered a bug?
>>>>>>>
>>>>>> No. If you tell read.table the column is integer and it's actually
>>>>>> character on disk, it should be an error.
>>>>>
>>>>>
>>>>> My reading of the `read.table` help page is that one should expect that
>>>>> when
>>>>> there is an 'integer'-class and an  `as.integer` function and  "integer"
>>>>> is the
>>>>> argument to colClasses, that `as.integer` will be applied to the values
>>>>> in the
>>>>> column. Should I be reading elsewhere?
>>>>>
>>>> I assume you're referring to the paragraph below.
>>>>
>>>>   Possible values are ‘NA’ (the default, when ‘type.convert’ is
>>>>   used), ‘"NULL"’ (when the column is skipped), one of the
>>>>   atomic vector classes (logical, integer, numeric, complex,
>>>>   character, raw), or ‘"factor"’, ‘"Date"’ or ‘"POSIXct"’.
>>>>   Otherwise there needs to be an ‘as’ method (from package
>>>>   ‘methods’) for conversion from ‘"character"’ to the specified
>>>>   formal class.
>>>>
>>>> I read that as meaning that an "as" method is required for classes not
>>>> already listed in the prior sentence.  It doesn't say an "as" method
>>>> will be applied if colClasses is one of the atomic, factor, Date, or
>>>> POSIXct classes; but I can see how you might assume that, since all
>>>> the atomic, factor, Date, and POSIXct classes already have "as"
>>>> methods...
>>>
>>>
>>> And this does suggest a workaround for ffdf:  instead of declaring the class
>>> to be "integer", declare a class "ffdf_integer", and write a conversion
>>> method.  Or simply read everything as character and call as.integer()
>>> explicitly.
>>
>> Just a note of concert since several proposed it:
>
> concerN?

Ah, yet again, that beautiful music I always hear in my head when I
read R-devel.

>
>> colClasses="character") followed by as.integer() or strtoi() misses
>> the validation, e.g. "foo" will be turned into NA_integer_.  Using
>> read.table() or scan() gives an error.
>
> The obvious fix for that would seem to be to use scan() on the character vector:
>
>> y <- c("1","2",3,4,5)
>> y
> [1] "1" "2" "3" "4" "5"
>> scan(text=y)
> Read 5 items
> [1] 1 2 3 4 5
>> y <- c("1","2",3,4,"NA")
>> scan(text=y)
> Read 5 items
> [1]  1  2  3  4 NA
>> y <- c("1","2",3,4,"foo")
>> scan(text=y)
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   scan() expected 'a real', got 'foo'

Yep, that's also what I proposed above, though it could have been more
explicit.  See also an earlier reply of mine where I refer to code of
readDataFrame for TabularTextFile
[[https://r-forge.r-project.org/scm/viewvc.php/pkg/R.filesets/R/TabularTextFile.R?view=markup&root=r-dots]
doing this (as an illustration for OP).

/H

>
>
>>
>> /Henrik
>>
>>>
>>> Duncan Murdoch
>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
>
>
>
>
>
>
>