[Rd] Slow IO: was [R] naive question

Vadim Ogranovich vograno at evafunds.com
Wed Jun 30 22:13:18 CEST 2004

> -----Original Message-----
> From: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk] 
> Sent: Wednesday, June 30, 2004 3:10 AM
> To: Vadim Ogranovich
> Cc: r-devel at stat.math.ethz.ch
> Subject: Re: [Rd] Slow IO: was [R] naive question
> "Vadim Ogranovich" <vograno at evafunds.com> writes:
> > ...
> > I can see at least two main reasons why R's IO is so slow (I didn't 
> > profile this though):
> > A) it reads from a connection char-by-char as opposed to 
> the buffered 
> > read. Reading each char requires a call to scanchar() which 
> then calls
> > Rconn_fgetc() (with some non-trivial overhead). 
> Rconn_fgetc() on its 
> > part is defined somewhere else (not in scan.c) and 
> therefore the call 
> > can not be inlined, etc.
> > B) mkChar, which is used very extensively, is too slow. 
> > ...
> Do you have some hard data on the relative importance of the 
> above issues?

Well, here is a little analysis which sheds some light. I have a file,
foo, 154M uncompressed. It contains about 3.8M lines

01/02% ls -l foo*
-rw-rw-r--    1 vograno  man      153797513 Jun 30 11:56 foo
-rw-rw-r--    1 vograno  man      21518547 Jun 30 11:56 foo.gz

# reading the files using standard UNIX utils takes no time
01/02% time cat foo > /dev/null
0.030u 0.110s 0:00.80 17.5%	0+0k 0+0io 124pf+0w
01/02% time zcat foo.gz  > /dev/null
1.210u 0.030s 0:01.24 100.0%	0+0k 0+0io 90pf+0w

# compute exact line count
01/02% zcat foo.gz | wc
3794929 3794929 153797513

# now we fire R-1.8.1
# we will experiment with the gzip-ed copy since we've seen that the
overhead of decompression is trivial
> nlines <- 3794929

# this exercises scanchar(), but not mkChar(), see scan() in scan.c
> system.time(scan(gzfile("foo.gz", open="r"), what="character", skip =
nlines - 1))
system.time(scan(gzfile("foo.gz", open="r"), what="character", skip =
nlines - 1))
Read 1 items
[1] 67.83  0.01 68.04  0.00  0.00

# this exercises both scanchar() and mkChar()
system.time(readLines(gzfile("foo.gz", open="r"), n = nlines))
[1] 110.61   0.83 112.44   0.00   0.00

It seems that scanchar() and mkChar() have comparable overheads in this

> ... This might be a changing balance, but I 
> think you're more on the mark with the mkChar issue. (Then 
> again, it is quite a bit easier to come up with buffering 
> designs for Rconn_fgetc than it is to redefine STRSXP...)

First off all I agree that redefining STRSXP is not easy, but this has a
potential to considerably speed up R as whole since name propagation
would work faster.
As to the mkChar() in scan() there are few tricks that can help. Say we
have a CSV file that contains categorical and numerical data. Here is
what we can do to minimize the number of calls to mkChar:

* when reading the file in as a bunch of lines (before type conversion)
do not call mkChar, rather pre-allocate large temporary char * arrays
(via R_alloc) and store the lines sequentially in the arrays. This
allows us to read the file into the memory with just few, however
expensive, calls to R_alloc. Here the arrays effectively serve as a heap
which will released by R at the end of the call.

* Field conversion
	- when converting numeric fields there is no need to call mkChar
at all (obvious)
	- when creating char fields that correspond to categorical data
(going from the first element to the end) we can maintain a hash table
that maps, char* -> SEXP, the field values encountered so far. When we
get a new field value we first look it up in the hash table and if it is
already there we use the corresponding SEXP to assign to the string
element. This leads to a considerable speed-up in the common case where
most field values are drawn from a small (<1000) set of "factor levels".

And a final observation once we are on the scan() subject. I've found it
more convenient to convert data column-by-column rather than row-by-row.
When you do it column-by-column you
* figure out the type of the column only once. Ditto about the
destination vector.
* maintain only one hash table for the current column, not for all
columns at once.


More information about the R-devel mailing list