[R] write.table() performance.

Carlos J. Gil Bellosta sigma at consultoresestadisticos.com
Thu Jul 1 19:36:36 CEST 2004


Dear r-helpers,

I know that there has already been enough questions on IO performance 
these last days, but I came accross the following situation today. I was 
comparing the performance of R with that of SAS's Risk Dimensions at 
generating random "scenarios". My dataset --all numeric entries-- would 
nicely fit into RAM and R would outperform SAS until... I wanted to 
export the results to a .csv file using the write.table() function. For 
reference, this output file was of about 30MB.  Moreover, the memory 
needed by R would increase sharply during the writing process.

I had a look at the code for the write.table() function and I found out 
that, basically, what it does is to create a very long text string from 
the data using paste() and then to print it using writeLines(). Rprof() 
showed that writeLines() would only use a mere 3% of the computing time, 
the rest being taken almost entirely by paste().

There are two directions in which performance could potentially be improved:

1.- Writing speed.
2.- Memory usage.

Regarding memory usage, I thought that perhaps a little rewriting of the 
write.table() function could be considered: instead of writing in RAM a 
single long text string, with a little overhead, the data frame to be 
printed could be splitted into shorter, recyclable, chunks, then 
paste()-ing them into shorter "buffer" strings and print them 
sequentially into the the output file. (Note: I am a complete ignorant 
on R's memory recycling rules and this could perhaps not work as 
intended because of them).

Regarding speed considerations, I see little hope as long as the paste() 
function is implicitly called by write.table(). Most likely, its 
execution time scales linearly with the number of lines in the data 
frame, so splitting it would render no benefits. Are there any hints on 
how could a performance improvement (other than linking external, ad hoc 
C code) be achieved? Do we really need to go through parse()? Would it 
perhaps be beneficial to include in R some specialized functions that 
achieved high output performance for writing out, say, only numeric 
values (this happens to be the case for me most of the time)?

Sorry for the long posting.

Carlos J. Gil Bellosta




More information about the R-help mailing list