[R] package for saving large datasets in ASCII

Ott Toomet siim at obs.ee
Sun Aug 11 14:51:33 CEST 2002


Hi,

I am Continuing discussion about dataframes in ASCII.

I have not overlooked the argument blocksize in write.matrix(), but
which is a sensible size?  I assumed the blocksize=1 is the most
memory-efficient, but (for smaller example) I experimented with
different sizes.  Initially, speed increased slightly, but seemed to
be constant or even decreasing from around value 5.

The problem for me is not the speed for small dataframes but the fact
that I was not able to save a large dataframe at all.  I think the
reason is associated with the first line of write.matrix() which is

    x <- as.matrix(x)

This converts the whole dataframe into a new ascii matrix, a process which
is both slow and memory consuming if the original object is large.  The
second place I am not sure about are lines

            cat(format(t(x[nlines + (1:nb), ])), file = file, 
                append = TRUE, sep = c(rep(sep, p - 1), "\n"))

isn't t(x[...]) creating new temporary objects?

Or have I misunderstood something?

BTW, are there any ways to check memory consumption of individual
objects and functions?

best wishes,

Ott

On Sat, 10 Aug 2002 ripley at stats.ox.ac.uk wrote:

  |?write.matrix  will tell you what you have overlooked, a sensible
  |blocksize.
  |
  |If `I am not sure about write.matrix()', surely reading the help page is a
  |first step?
  |
  |On Sat, 10 Aug 2002, Ott Toomet wrote:
  |
  |> Hi,
  |>
  |> I have made a tiny package for saving dataframes in ASCII format.  The
  |> package contains functions save.table() and save.delim(), the first
  |> mimics (not completely) write.table() and the second uses just
  |> different default values, suitable for read.delim().
  |>
  |> The reason I have written the functions is that I have had problems
  |> with saving large dataframes in ASCII form.  write.table() essentially
  |> makes a huge string in memory from the dataframe.  I am not sure about
  |> write.matrix() (in MASS), but in my practice it is too
  |> memory-intensive also.  My approach was to write the whole thing in C
  |> in this way that the function takes the values from the dataframe, one
  |> scalar value by time, and writes them immediately to the file.  This,
  |> of course, puts certain limitations on the contents of dataframe and
  |> output format.
  |>
  |> Here is an example of the result:
  |>
  |> > dim(e2000)
  |> [1] 7505 1197
  |> > library(savetable)
  |> > system.time(save.table(e2000, "e2000"))
  |> [1] 38.04  0.48 48.75  0.00  0.00
  |> > library(MASS)
  |> > system.time(write.matrix(e2000, "e2000", sep=",", 1))
  |>
  |>  -- killed after 10 minutes swapping.
  |>
  |> And now a smaller example:
  |>
  |> > dim(e2000s)
  |> [1]  100 1197
  |> > library(savetable)
  |> > system.time(save.table(e2000s, "e2000s"))
  |> [1] 0.45 0.00 0.56 0.00 0.00
  |> > system.time(write.table(e2000s, "e2000s"))
  |> [1] 31.21  0.11 38.99  0.00  0.00
  |> > library(MASS)
  |> > system.time(write.matrix(e2000s, "e2000s", sep=",", 1))
  |> [1] 4.01 0.66 5.45 0.00 0.00
  |>
  |> None of the functions started swapping now, but as you can see,
  |> save.table() is still around 10 times as fast as write.matrix().
  |> Examples are on my 128MB PII-400 linux system and R 1.4.0.
  |>
  |> I am not sure if there is much interest for such a package, so I put
  |> it on my own website instead of CRAN
  |> (http://www.obs.ee/~siim/savetable_0.1.0.tar.gz).  Any feedback is
  |> appreciated.
  |>
  |> Many thanks to Brian Ripley and the others, who helped me accessing R
  |> objects in C.
  |>
  |>
  |> Best wishes,
  |>
  |> Ott Toomet

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list