[R] A test on a 8000 x 1000 matrix using blocksize = 1000 ran in about 150MbRe: [R] package for saving large datasets in ASCII

ripley@stats.ox.ac.uk ripley at stats.ox.ac.uk
Sun Aug 11 15:53:25 CEST 2002


The sort of `large' here is 7500x1200.  That's 72Mb if real numbers, so
let's assume you have at least 256Mb to use.  I ran the following on
Windows with a 256Mb limit (and I had to use R-devel to do so). I actually
found it difficult to create a data frame of that size in 256Mb, and
resorted to

A1 <- vector("list", 1000)
for(i in 1:1000) A1[[i]] <- rnorm(8000)
class(A1) <- "data.frame"
row.names(A1) <- 1:8000

which took 15 secs and 140Mb as an underhand way to make a data frame.
(1.5.1 took too much memory here.)

Then

A2 <- as.matrix(A1)

took 1.8secs (hardly slow) and an additional 64Mb to hold the object A2.
I then deleted A1.  Running

write.table(A2, "foo.dat", blocksize=1000)

used about 150Mb in about four minutes.  That is formatting 8 million
numbers, and 85% of the time was spent in the system calls, as one should
expect.  (I suspect I did not need to delete A1, but didn't want to wait
around to find out.)

So

1) you could have checked your claims by some simple experiments.

2) as claimed, write.matrix does indeed do the job.

On Sun, 11 Aug 2002, Ott Toomet wrote:

> I am Continuing discussion about dataframes in ASCII.
>
> I have not overlooked the argument blocksize in write.matrix(), but
> which is a sensible size?  I assumed the blocksize=1 is the most
> memory-efficient, but (for smaller example) I experimented with
> different sizes.  Initially, speed increased slightly, but seemed to
> be constant or even decreasing from around value 5.

A few hundred, probably.

Why did you assume that blocksize=1 was best?  R is a vector language, and
it is normally best to use the largest blocks that you can fit in memory.

> The problem for me is not the speed for small dataframes but the fact
> that I was not able to save a large dataframe at all.  I think the
> reason is associated with the first line of write.matrix() which is
>
>     x <- as.matrix(x)
>
> This converts the whole dataframe into a new ascii matrix, a process which

Not if it is a matrix: what's the function name?  For a general data frame
there really is no choice but to convert each column as a whole.

> is both slow and memory consuming if the original object is large.

False: see above.

> The
> second place I am not sure about are lines
>
>             cat(format(t(x[nlines + (1:nb), ])), file = file,
>                 append = TRUE, sep = c(rep(sep, p - 1), "\n"))
>
> isn't t(x[...]) creating new temporary objects?

Yes (and so is the format call), but there is garbage collection.  That's
one reason why a blocksize of 1 is not at all sensible, forcing the loop
to be run thousands of times.  Just choose blocksize to keep this step in
your memory bounds.

> Or have I misunderstood something?

Your memory size?  I suggest buying another 512Mb/1Gb of RAM.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list