[Rd] read.table() with quoted integers

ashmoran ash.moran at patchspace.co.uk
Sat Apr 26 18:01:37 CEST 2014


Milan Bouchet-Valat wrote
> It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
> quoted integers as an acceptable value for columns for which
> colClasses="integer". But when colClasses is omitted, these columns are
> read as integer anyway.
> 
> For example, let's consider a file named file.dat, containing:
> "1"
> "2"
> 
>> read.table("file.dat", colClasses="integer")
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> : 
>   scan() expected 'an integer' and got '"1"'

Hi 

I just ran into a variation of this. I'm teaching myself agent based
modelling from a book that uses NetLogo as the implementation language[1].
NetLogo has a feature called BehaviourSpaces that runs models over a varying
range of parameter values and make arbitrary observations at each time step,
which it then outputs to a CSV. One of the exercises involves plotting some
graphs of a model run, but the output needs some processing before it can be
graphed. Rather than hack away at the data by hand each time I run it, I
decided to find a stats package to help, and I chose R. I'm a complete
beginner to R, and I've been using the R in Action early access PDF as a
guide[2]. I'm using R 3.1.0 GUI 1.64 Mavericks build (6734).

The NetLogo CSV writer quotes all values, and mixes integers and floats. So
a column of data might contain say (with the quotes actually in the file)
"0", "1.25", "1", "2", "3.175". I tried importing the data like this:

    profit <- read.csv("BusinessInvestor1 Profit-table.csv", sep=",",
header=TRUE, skip=6)

But then some of the data is read in as factors:

    str(profit)
    'data.frame':	1560 obs. of  9 variables:
     $ X.run.number.                                                 : int 
8 6 2 7 5 1 3 4 6 8 ...
     $ restrict.sensing.radius                                       :
Factor w/ 1 level "false": 1 1 1 1 1 1 1 1 1 1 ...
     $ risk.multiplier                                               : int 
1 1 1 1 1 1 1 1 1 1 ...
     $ sensing.radius                                                : int 
1 1 1 1 1 1 1 1 1 1 ...
     $ profit.multiplier                                             : num 
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
     $ X.step.                                                       : int 
0 0 0 0 0 0 0 0 1 1 ...
     $ mean..wealth..of.turtles                                      :
Factor w/ 1501 levels "0","100038.136",..: 1 1 1 1 1 1 1 1 623 550 ...
     $ mean..profit..of.patches.with..any..turtles.here.             :
Factor w/ 1547 levels "2503.675","2582.275",..: 1 8 7 6 5 4 3 10 278 230 ...
     $ mean..failure.probability..of.patches.with..any..turtles.here.:
Factor w/ 1558 levels "0.026069451281579437",..: 1504 1528 1508 1518 1516
1514 1512 1536 1321 1471 ...

(For reasons I don't understand, the profit.multiplier parameter – which
runs "0.5", "0.6", …, "1" – is imported as a numeric, whereas the
observation values get turned into factors.)

I read about colClasses but this trips over the "quoted integers aren't
integers" bug:

    profit <- read.csv("BusinessInvestor1 Profit-table.csv", sep=",",
header=TRUE, skip=6, colClasses=c("integer", "logical", "numeric",
"numeric", "numeric", "integer", "numeric", "numeric", "numeric"))
    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings,  : 
      scan() expected 'an integer', got '"8"'

I created a little script to import the data, and use some casts to clean it
up. At first I thought it was working, until I realised this line (to
process CSV data in the range 0...1):

    profit$mean_failure_probability_of_inhabited <-
as.numeric(profit$mean_failure_probability_of_inhabited)

Was producing crazy values:

    str(profit)
    'data.frame':   1560 obs. of  9 variables:
    ...
     $ mean_failure_probability_of_inhabited: num  1504 1528 1508 1518 1516
...

Eventually I figured out to do this (although I haven't yet figured out
why):

    profit$mean_failure_probability_of_inhabited <-
as.numeric(as.character(profit$mean_failure_probability_of_inhabited))

Anyway, for a beginner coming to R, this is all REALLY confusing, and it's
taken me several hours to get my head round it. Although after reading about
it a bit I can see the implementation issues causing this behaviour, as a
noob it just feels like "R can't import CSV data". The most baffling thing
is how telling R what format the data in each column is in actually
*reduces* its ability to read the file! (For a while I thought it was
complaining because "8" is an integer, not a real, but now I see it's
because it's seeing it as a string.) My understanding of the CSV was the
same as Peter Meilstrup describes it later in the thread – that quotes in a
CSV are to allow the delimiter character in a value, and don't imply
anything about the type of the data (because CSVs are untyped).

Googling the scan() error led to this mailing list thread so I thought I'd
describe my experience. If there's a more intuitive way for read.csv /
read.table to work it might save beginners like me a lot of head-scratching!

Best regards
Ash

[1] http://www.amazon.com/dp/0691136742/
[2] http://www.manning.com/kabacoff2/



--
View this message in context: http://r.789695.n4.nabble.com/read-table-with-quoted-integers-tp4677249p4689530.html
Sent from the R devel mailing list archive at Nabble.com.



More information about the R-devel mailing list