[Rd] Request to speed up save()

Nathan Kurz nate at verse.com
Thu Jan 15 23:18:13 CET 2015


On Thu, Jan 15, 2015 at 11:08 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
> In addition to the major points that others made: if you care about speed, don't use compression. With today's fast disks it's an order of magnitude slower to use compression:
>
>> d=lapply(1:10, function(x) as.integer(rnorm(1e7)))
>> system.time(saveRDS(d, file="test.rds.gz"))
>    user  system elapsed
>  17.210   0.148  17.397
>> system.time(saveRDS(d, file="test.rds", compress=F))
>    user  system elapsed
>   0.482   0.355   0.929
>
> The above example is intentionally well compressible, in real life the differences are actually even bigger. As people that deal with big data know well, disks are no longer the bottleneck - it's the CPU now.

Respectfully, while your example would imply this, I don't think this
is correct in the general case.    Much faster compression schemes
exist, and using these can improve disk I/O tremendously.  Some
schemes that are so fast that it's even faster to transfer compressed
data from main RAM to CPU cache and then decompress to avoid being
limited by RAM bandwidth: https://github.com/Blosc/c-blosc

Repeating that for emphasis, compressing and uncompressing can be
actually be faster than a straight memcpy()!

Really, the issue is that 'gzip' and 'bzip2' are bottlenecks.   As
Steward suggests, this can be mitigated by throwing more cores at the
problem.  This isn't a bad solution, as there are often excess
underutilized cores.  But much better would be to choose a faster
compression scheme first, and then parallelize that across cores if
still necessary.

Sometimes the tradeoff is between amount of compression and speed, and
sometimes some algorithms are just faster than others.   Here's some
sample data for the test file that your example creates:

> d=lapply(1:10, function(x) as.integer(rnorm(1e7)))
> system.time(saveRDS(d, file="test.rds", compress=F))
   user  system elapsed
  0.554   0.336   0.890

nate at ubuntu:~/R/rds$ ls -hs test.rds
382M test.rds
nate at ubuntu:~/R/rds$ time gzip -c test.rds > test.rds.gz
real: 16.207 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.gz
35M test.rds.gz
nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.330 sec

nate at ubuntu:~/R/rds$ time gzip -c --fast test.rds > test.rds.gz
real: 4.759 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.gz
56M test.rds.gz
nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.942 sec

nate at ubuntu:~/R/rds$ time pigz -c  test.rds > test.rds.gz
real: 2.180 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.gz
35M test.rds.gz
nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.375 sec

nate at ubuntu:~/R/rds$ time pigz -c --fast test.rds > test.rds.gz
real: 0.739 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.gz
57M test.rds.gz
nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.851 sec

nate at ubuntu:~/R/rds$ time lz4c test.rds > test.rds.lz4
Compressed 400000102 bytes into 125584749 bytes ==> 31.40%
real: 1.024 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.lz4
120M test.rds.lz4
nate at ubuntu:~/R/rds$ time lz4 test.rds.lz4 > discard
Compressed 125584749 bytes into 95430573 bytes ==> 75.99%
real: 0.775 sec

Reading that last one more closely, with single threaded lz4
compression, we're getting 3x compression at about 400MB/s, and
decompression at about 500MB/s.   This is faster than almost any
single disk will be.  Multithreaded implementations will make even the
fastest RAID be the bottleneck.

It's probably worth noting that the speeds reported in your simple
example for the uncompressed case are likely the speed of writing to
memory, with the actual write to disk happening at some later time.
Sustained throughput will likely be slower than your example would
imply

If saving data to disk is a bottleneck, I think Stewart is right that
there is a lot of room for improvement.

--nate



More information about the R-devel mailing list