[Rd] Request to speed up save()

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Jan 15 18:54:58 CET 2015


On 15/01/2015 12:45, Stewart Morris wrote:
> Hi,
>
> I am dealing with very large datasets and it takes a long time to save a
> workspace image.

Sounds like bad practice on your part ... saving images is not 
recommended for careful work.

> The options to save compressed data are: "gzip", "bzip2" or "xz", the
> default being gzip. I wonder if it's possible to include the pbzip2
> (http://compression.ca/pbzip2/) algorithm as an option when saving.

It is not an 'algorithm', it is a command-line utility widely available 
for Linux at least.

> "PBZIP2 is a parallel implementation of the bzip2 block-sorting file
> compressor that uses pthreads and achieves near-linear speedup on SMP
> machines. The output of this version is fully compatible with bzip2
> v1.0.2 or newer"
>
> I tested this as follows with one of my smaller datasets, having only
> read in the raw data:
>
> ============
> # Dumped an ascii image
> save.image(file='test', ascii=TRUE)

Why do that if you are at all interested in speed?  A pointless (and 
inaccurate) binary to decimal conversion is needed.

>
> # At the shell prompt:
> ls -l test
> -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test
>
> time bzip2 -9 test
> 364.702u 3.148s 6:14.01 98.3%    0+0k 48+1273976io 1pf+0w
>
> time pbzip2 -9 test
> 422.080u 18.708s 0:11.49 3836.2%    0+0k 0+1274176io 0pf+0w
> ============
>
> As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took
> 11 seconds, admittedly on a 64 core machine (running at 50% load). Most
> modern machines are multicore so everyone would get some speedup.

But R does not by default save bzip2-ed ASCII images ... and gzip is the 
default because its speed/compression tradeoffs (see ?save) are best for 
the typical R user.

And your last point is a common misunderstanding, that people typically 
have lots of spare cores which are zero-price.  Even on my 8 (virtual) 
core desktop when I typically do have spare cores, using them has a 
price in throttling turbo mode and cache contention.  Quite a large 
price: an R session may run 1.5-2x slower if 7 other tasks are run in 
parallel.

> Is this feasible/practical? I am not a developer so I'm afraid this
> would be down to someone else...

Not in base R.  For example one would need a linkable library, which the 
site you quote is not obviously providing.

Nothing is stopping you writing a sensible uncompressed image and 
optionally compressing it externally, but note that for some file 
systems compressed saves are faster because of reduced I/O.

> Thoughts?

>
> Cheers,
>
> Stewart
>


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK



More information about the R-devel mailing list