[R] write.table very slow

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Dec 23 10:08:00 CET 2001


[Sorry for the belated reply: we've been busy getting 1.4.0 out.]

On Thu, 6 Dec 2001, Ott Toomet wrote:

> Hi,
>
> I think the problem lies in the code of write.table().  It is essentially a
> paste() function, which pastes all the data in the table into a long
> character string and thereafter writes the string into file.  I was not able
> to write a dataset of 7500 obs times 1200 variables at all, I had to split
> it up into smaller units and write those separately.  In addition caused it
> much swapping on my 128MB system.

R has profiling, and if you run profiling you will see this is just not
true.  I ran tests of 600 columns.  1/3 of the time was spent adding
quotes, and 1/2 of the time on finding the NAs!

I do wonder if write.table is the appropriate tool.  It is designed for
data frames, and each of those 6000 columns could in principle be a
different class of R object.  So almost inevitably it is going to be slow
even for 1 row.  If what you really have is a matrix, write.matrix in
package MASS is about 10x faster.  (It also uses format and so makes a
better job of the output for numeric matrices.)

As for 7500 x 1200, that would benefit by being written in blocks of rows.
However, unfortunately that's not in general possible for a data frame, as
the conversion to a character matrix (in one of as.matrix, quoting or
paste) may well depend on all the rows.  (Think about the equivalent of
printing to `digits' significant places.)  And one has no control over
as.character methods for all possible constituents of data frames.

In any case, as the dataset is about 70Mb, you are inevitably going to
have problems in R on a 128Mb system, and writing a data file that size
will be slow on many file systems. I failed on a 512Mb system too.  But
writing 750 rows at a time was quite feasible (20 secs with write.matrix,
125 secs with write.table, on a 1GHz machine).  Just conversion to
character took 200 secs and 350Mb for the whole matrix.

It would be worth producing a blocked version of write.matrix, but it
seems not possible to do much about write.table without reducing its
generality, which is needed for some smaller problems.

> I think (I have not tried) it could work faster in your case if you just
> save the observatons separately into separate files and thereafter merge the
> files (but it is worth of doing only if you have to write the table
> repeatedly, of course).  In long run I think a rewrite of the write.table()
> in C in such a way that it do not store the whole file in memory may be a
> solution.

(Not in principle feasible.)

> Regards,
>
> Ott Toomet
> -------------------------------
> On Wed, 5 Dec 2001, Cole Harris wrote:
>
> > When writing tables with a large number of columns, write.table() seems to take way too much time - e.g. a table with ~80 rows and ~6000 columns takes ~30 min cpu on my 900 MHz pc.
> > I would appreciate any explainations or advice.
> >
> > Thanks,
> > Cole

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list