[Rd] Request to speed up save()

Stewart Morris Stewart.Morris at igmm.ed.ac.uk
Thu Jan 15 13:45:50 CET 2015


I am dealing with very large datasets and it takes a long time to save a 
workspace image.

The options to save compressed data are: "gzip", "bzip2" or "xz", the 
default being gzip. I wonder if it's possible to include the pbzip2 
(http://compression.ca/pbzip2/) algorithm as an option when saving.

"PBZIP2 is a parallel implementation of the bzip2 block-sorting file 
compressor that uses pthreads and achieves near-linear speedup on SMP 
machines. The output of this version is fully compatible with bzip2 
v1.0.2 or newer"

I tested this as follows with one of my smaller datasets, having only 
read in the raw data:

# Dumped an ascii image
save.image(file='test', ascii=TRUE)

# At the shell prompt:
ls -l test
-rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test

time bzip2 -9 test
364.702u 3.148s 6:14.01 98.3%	0+0k 48+1273976io 1pf+0w

time pbzip2 -9 test
422.080u 18.708s 0:11.49 3836.2%	0+0k 0+1274176io 0pf+0w

As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 
11 seconds, admittedly on a 64 core machine (running at 50% load). Most 
modern machines are multicore so everyone would get some speedup.

Is this feasible/practical? I am not a developer so I'm afraid this 
would be down to someone else...




Stewart W. Morris
Centre for Genomic and Experimental Medicine
The University of Edinburgh
United Kingdom

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

More information about the R-devel mailing list