[Rd] Request to speed up save()

Simon Urbanek simon.urbanek at r-project.org
Thu Jan 15 20:08:58 CET 2015


In addition to the major points that others made: if you care about speed, don't use compression. With today's fast disks it's an order of magnitude slower to use compression:

> d=lapply(1:10, function(x) as.integer(rnorm(1e7)))
> system.time(saveRDS(d, file="test.rds.gz"))
   user  system elapsed 
 17.210   0.148  17.397 
> system.time(saveRDS(d, file="test.rds", compress=F))
   user  system elapsed 
  0.482   0.355   0.929 

The above example is intentionally well compressible, in real life the differences are actually even bigger. As people that deal with big data know well, disks are no longer the bottleneck - it's the CPU now.

Cheers,
Simon

BTW: why in the world would you use ascii=TRUE? It's pretty much the slowest possible serialization you can use - it will even overshadow compression:

> system.time(saveRDS(d, file="test.rds", compress=F))
   user  system elapsed 
  0.459   0.383   0.940 
> system.time(saveRDS(d, file="test-a.rds", compress=F, ascii=T))
   user  system elapsed 
 36.713   0.140  36.929 

and the same goes for reading:

> system.time(readRDS("test-a.rds"))
   user  system elapsed 
 27.616   0.275  27.948 
> system.time(readRDS("test.rds"))
   user  system elapsed 
  0.609   0.184   0.795 



> On Jan 15, 2015, at 7:45 AM, Stewart Morris <Stewart.Morris at igmm.ed.ac.uk> wrote:
> 
> Hi,
> 
> I am dealing with very large datasets and it takes a long time to save a workspace image.
> 
> The options to save compressed data are: "gzip", "bzip2" or "xz", the default being gzip. I wonder if it's possible to include the pbzip2 (http://compression.ca/pbzip2/) algorithm as an option when saving.
> 
> "PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 or newer"
> 
> I tested this as follows with one of my smaller datasets, having only read in the raw data:
> 
> ============
> # Dumped an ascii image
> save.image(file='test', ascii=TRUE)
> 
> # At the shell prompt:
> ls -l test
> -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test
> 
> time bzip2 -9 test
> 364.702u 3.148s 6:14.01 98.3%	0+0k 48+1273976io 1pf+0w
> 
> time pbzip2 -9 test
> 422.080u 18.708s 0:11.49 3836.2%	0+0k 0+1274176io 0pf+0w
> ============
> 
> As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 11 seconds, admittedly on a 64 core machine (running at 50% load). Most modern machines are multicore so everyone would get some speedup.
> 
> Is this feasible/practical? I am not a developer so I'm afraid this would be down to someone else...
> 
> Thoughts?
> 
> Cheers,
> 
> Stewart
> 
> -- 
> Stewart W. Morris
> Centre for Genomic and Experimental Medicine
> The University of Edinburgh
> United Kingdom
> 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 



More information about the R-devel mailing list