[Rd] Request to speed up save()
toth.denes at ttk.mta.hu
Thu Jan 15 19:06:48 CET 2015
On 01/15/2015 01:45 PM, Stewart Morris wrote:
> I am dealing with very large datasets and it takes a long time to save a
> workspace image.
> The options to save compressed data are: "gzip", "bzip2" or "xz", the
> default being gzip. I wonder if it's possible to include the pbzip2
> (http://compression.ca/pbzip2/) algorithm as an option when saving.
> "PBZIP2 is a parallel implementation of the bzip2 block-sorting file
> compressor that uses pthreads and achieves near-linear speedup on SMP
> machines. The output of this version is fully compatible with bzip2
> v1.0.2 or newer"
> I tested this as follows with one of my smaller datasets, having only
> read in the raw data:
> # Dumped an ascii image
> save.image(file='test', ascii=TRUE)
> # At the shell prompt:
> ls -l test
> -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test
> time bzip2 -9 test
> 364.702u 3.148s 6:14.01 98.3% 0+0k 48+1273976io 1pf+0w
> time pbzip2 -9 test
> 422.080u 18.708s 0:11.49 3836.2% 0+0k 0+1274176io 0pf+0w
> As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took
> 11 seconds, admittedly on a 64 core machine (running at 50% load). Most
> modern machines are multicore so everyone would get some speedup.
> Is this feasible/practical? I am not a developer so I'm afraid this
> would be down to someone else...
Take a look at the gdsfmt package. It supports the superfast Lz4
compression algorithm + it provides highly optimized functions to write
to/read from disk.
More information about the R-devel