[Rd] warning for inefficiently compressed datasets

Uwe Ligges ligges at statistik.tu-dortmund.de
Wed Dec 7 09:34:08 CET 2011



On 06.12.2011 23:28, Hervé Pagès wrote:
> Hi,
>
> Recently added to doc/NEWS.Rd:
>
> 'R CMD check' now gives a warning rather than a note if it finds
> inefficiently compressed datasets. With 'bzip2' and 'xz' compression
> having been available since R 2.10.0, there is no excuse for not
> using them.
>
> Why isn't a note enough for this?
>
> Generally speaking, warnings are for things that are dangerous,
> or unsafe, or unportable, or for anything that could potentially
> cause trouble. I don't see how using gzip instead of bzip2 or xz
> could fall into that category (and BTW gzip is the default for
> save() and for 'R CMD build' resave-data feature).
>
> The problem is that bzip2 and xz compressions are slower and also
> require more memory than gzip. Bioconductor has big data packages
> and sometimes it makes sense to use gzip and not bzip2 or xz. For
> example, when loading Human chromosome 1 from disk, bzip2 and xz
> are 7 and 3.4 times slower than gzip, respectively:
>
>  > system.time(load("chr1-gzip.rda"))
> user system elapsed
> 1.210 0.180 1.384
>
>  > system.time(load("chr1-bzip2.rda"))
> user system elapsed
> 9.500 0.160 9.674
>
>  > system.time(load("chr1-xz.rda"))
> user system elapsed
> 4.46 0.20 4.69
>
> hpages at latitude:~/testing$ ls -lhtr chr1-*.rda
> -rw-r--r-- 1 hpages hpages 61M 2011-12-06 12:13 chr1-gzip.rda
> -rw-r--r-- 1 hpages hpages 55M 2011-12-06 12:15 chr1-bzip2.rda
> -rw-r--r-- 1 hpages hpages 49M 2011-12-06 12:25 chr1-xz.rda
>
> This is with R-2.14.0 on a 64-bit Ubuntu laptop with 8GB of RAM.
>
> The size on disk doesn't really matter and it doesn't matter either
> that the source tarball for the full Human genome ends up being 20%
> bigger when using gzip instead of xz: the 20% extra time it takes to
> download it (which needs to be done only once) will largely be
> compensated by the fact that most analyses will run faster e.g. in
> 40-45 sec. instead of more than 2 minutes (for many short analyses,
> loading the chromosomes into memory is the bottleneck).


Oh, from a European side this 20% extra time may be an hour when 
downloading from the BioC master rather than a mirror.
And space and traffic is an issue for CRAN.



> Is there a way to turn this warning off? If not, could an option be
> added to 'R CMD check' to turn this warning off? Something along the
> lines of the --no-resave-data option for 'R CMD build'.


The manual tells us:

"The following environment variables can be used to customize the 
operation of check: a convenient place to set these is the file 
‘~/.R/check.Renviron’.

[...]

_R_CHECK_COMPACT_DATA2_

If true, check data for ascii and uncompressed saves, and also check if 
using bzip2 or xz compression would be significantly better. Implies 
_R_CHECK_COMPACT_DATA_ is true. Default: true."


Uwe



>
> Thanks,
> H.
>



More information about the R-devel mailing list