[Rd] Slow IO: was [R] naive question

Peter Dalgaard p.dalgaard at biostat.ku.dk
Wed Jun 30 12:09:42 CEST 2004


"Vadim Ogranovich" <vograno at evafunds.com> writes:

> I believe IO in R is slow because of the way it is implemented, not
> because it has to do some extra work for the user. 
> 
> I compared scan() with 'what' argument set (which is, AFAIK, is the
> fastest way to read a CSV file) to an equivalent C code. It turned out
> to be 20 - 50 times slower.
> I can see at least two main reasons why R's IO is so slow (I didn't
> profile this though):
> A) it reads from a connection char-by-char as opposed to the buffered
> read. Reading each char requires a call to scanchar() which then calls
> Rconn_fgetc() (with some non-trivial overhead). Rconn_fgetc() on its
> part is defined somewhere else (not in scan.c) and therefore the call
> can not be inlined, etc.
> B) mkChar, which is used very extensively, is too slow. There are ways
> to minimize the number of calls to mkChar, but I won't expand on it in
> this message.
> 
> I brought this up because it seems that many people believe that the
> slowness is inherent and is a tradeoff for something else. I don't think
> this is the case.

Do you have some hard data on the relative importance of the above
issues?

I wouldn't think that R is really unbuffered, since there is buffering
underlying the various fgetc() variants. Most C programs will do
char-by-char processing by the same definition. The lack of inlining
is sort of a consequence of a design where Rconn_fgetc() is
switchable. However, conventional wisdom is that all of this tends to
drown out compared to disk i/o. This might be a changing balance, but
I think you're more on the mark with the mkChar issue. (Then again, it
is quite a bit easier to come up with buffering designs for
Rconn_fgetc than it is to redefine STRSXP...)
 
-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907



More information about the R-devel mailing list