[R] naive question

Fri Jul 2 02:22:21 CEST 2004

As part of a continuing thread on the cost of loading large
amounts of data into R,

"Vadim Ogranovich" <vograno at evafunds.com> wrote:
	R's IO is indeed 20 - 50 times slower than that of equivalent C code
	no matter what you do, which has been a pain for some of us.

I wondered to myself just how bad R is at reading,
when it is given a fair chance.  So I performed an experiment.
My machine (according to "Workstation Info") is a SunBlade 100 with 640MB
of physical memory running SunOS 5.9 Generic, according to fpversion this
is an Ultra2e with the CPU clock running at 500MHz and the main memory
clock running at 84MHz (wow, slow memory).  R.version is
platform sparc-sun-solaris2.9
arch     sparc               
os       solaris2.9          
system   sparc, solaris2.9   
status                       
major    1                   
minor    9.0                 
year     2004                
month    04                  
day      12                  
language R                   
and althnough this is a 64-bit machine, it's a 32-bit installation of R.

The experiment was this:
(1) I wrote a C program that generated 12500 rows of 800 columns, the
    numbers were integers 0..999,999,999 generated using drand48().
    These numbers were written using printf().  It is possible to do
    quite a bit better by avoiding printf(), but that would ruin the
    spirit of the comparison, which is to see what can be done with
    *straightforward* code using *existing* library functions.

    21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds.

    The sizes were chosen to get 100MB; the actual size was
    12500 (lines) 10000000 (words) 100012500 (bytes)

(2) I wrote a C program that read these numbers using scanf("%d"); it    
    "knew" there were 800 numbers per row and 12500 numbers in all.
    Again, it is possible to do better by avoiding scanf(), but the
    point is to look at *straightforward* code.

    18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds.

(3) I started R, played around a bit doing other things, then issued this
    command:

    > system.time(xx <- read.table("/tmp/big.dat", header=FALSE, quote="",
    + row.names=NULL, colClasses=rep("numeric",800), nrows=12500,
    + comment.char="")

    So how long _did_ it take to read 100MB on this machine?

    71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds.

The result:  the R/C ratio was less than 4, whether you measure cpu time
or real time.  It certainly wasn't anywhere near 20-50 times slower.

Of course, *binary* I/O in C *would* be quite a bit faster:
(1') generate same integers but write a row at a time using fwrite():
     5 seconds cpu, 25 seconds real; 40 MB.

(2') read same integers a row at a time using fread()
     0.26 seconds cpu, 1 second real.

This would appear to more than justify "20-50 times slower", but reading
binary data and reading data in a textual representation are different
things, "less than 4 times slower" is the fairer measure.  However, it
does emphasise the usefulness of problem-specific bulk reading techniques.

I thought I'd give you another R measurement:
> system.time(xx <- read.table("/tmp/big.dat", header=FALSE))
But I got sick of waiting for it, and killed it after 843 cpu seconds,
3075 real seconds.  Without knowing how far it had got, one can say no
more than that this is at least 10 times slower than the more informed
call to read.table.

What this tells me is that if you know something about the data that
you _could_ tell read.table about, you do yourself no favour by keeping
read.table in the dark.  All those options are there for a reason, and
it *will* pay to use them.