[R] read.csv timing

Mon Jun 10 19:53:59 CEST 2013

here are some small benchmarks on an i7-2600k with an SSD:

input file: 104,126 rows with 76 columns.  all numeric.

linux> time bzcat bzfile.csv.bz2 > /dev/null  --> 1.8 seconds

R> d <- read.csv( pipe( bzfile ) )   --> 6.3 seconds
R> d <- read.csv( pipe( bzfile ), colClasses="numeric")  --> 4.2 seconds

R more than doubles the time it takes to load the file to convert it
into an R data structure.  if the colClasses are not specified, then
it takes another 50% longer.

some more experiments: save in R format (gzip format) --- this
increases file size from 15MB to 20MB.  how fast is the filesystem?

linux> time gzcat file.Rdata > /dev/null  --> 0.4 seconds

the linux file system and CPU can decompress the 15MB .bz2 file in 1.8
seconds and decompress the 20MB .gz file in 0.4 seconds.  this is
surprising.  let's make sure that this is due to the .gz format.
indeed:

linux> bunzip bzfile.csv.bz2 ; gzip bzfile.csv
linux> time gzcat bzfile.csv.gz > /dev/null  --> 0.4 seconds

reading .gz files is much faster on my linux system than reading bz
files.  this surprises me.  I would have thought my CPU is so fast at
decompressing even bzip2 that it is almost zero, so I thought the disk
space was the primary determinant of speed, and bzip2 should have been
faster.  well, ok, maybe slower, but not by a factor of 4.

now I am thinking that maybe I should use .gz files to store my data.
but the advantages are surprisingly not as great:

R> d <- read.csv( pipe( gzfile ) )   --> 5.7 seconds
R> d <- read.csv( pipe( gzfile ), colClasses="numeric")  --> 2.6 seconds
R> d <- read.csv( gzfile( gzfile ), colClasses="numeric") --> 4.5
seconds   (surprisingly slower)

(the first and second versions are using R's gzfile, but literally
"gzcat .. |" in a pipe here.)

conclusion: a .gz file can be read from file to memory about four
times faster than a .bz file by the linux file system (outside R).
the conversion from strings in memory nto R doubles takes about as
much time as the .bz file system decompression read.  bzip2 is a more
efficient storage method than .gz, but its decompression is
considerably slower (the fact that there is less to read from disk
does not make up for the CPU decompression overhead).

saving the data in native R format essentially has no decompression
penalty and becomes close to native fast reading of .gz data.  chances
are this is because it has .gz support baked in.  gzfile does not help
with read.csv, however.

/iaw
----
Ivo Welch (ivo.welch at gmail.com)

On Mon, Jun 10, 2013 at 10:09 AM, ivo welch <ivo.welch at gmail.com> wrote:
>> Surely you know the types of the columns?  If you specify it in advance,
>> read.table and relatives will be much faster.
>>
>> Duncan Murdoch
>
> thx, duncan.  yes, I do know the types of columns, but I did not
> realize how much faster these functions become.  on my SSD-based
> system, the speedup is about a factor of 2.  that is, read.csv on a
> bzip2 file that takes 10 seconds without colClasses takes 5 seconds
> with colClasses.  I don't know how to benchmark intermittent memory
> usage, but my guess is that with colClasses, it requires less memory,
> too.  in fact, my naive and incorrect assumption had been that
> read.csv would just read ithe file nto a dynamic string array and then
> convert each string, and this would not take much longer than if it
> converted as it went along.  so, I had thought "more memory use but
> not more time."  wrong.
>
> I would add to the man (.Rd) page the sentence "Specifying colClasses
> can speed up read.csv" where it describes the option.)
>
>
> once I will figure out how to bake C into R, I may try to write a fast
> filter function for myself, but share it for others wanting to use it.
>
> regards,
>
> /iaw