[R] Dec. 1, 2009 tip of the day

John Christie jc at or.psychology.dal.ca
Tue Dec 1 05:23:23 CET 2009


RE: Compression

Hi R-Users,

You can deal with pretty decent size data sets in R on a relatively new computer.  I have one that I have been working with that is a nearly 100MB plain text file.  With storage as inexpensive as it is these days that's not really all that much data and I could store it just as it is.

Having said that you may want to compress those data files.  There are two reasons for this.  One is that while storage is cheap, large files can be harder to move around.  Emailing is an example of something that's difficult to do with a larger data file.  The other is that, even though your drive has lots of capacity and seems to go pretty fast, it is in fact much much slower than your CPU.  Recognizing this, Apple built transparent background compression into the file system in it's latest operating system.  It speeds things up because the computer spends less time accessing the disk (and more time using the CPU decompressing and compressing it, but the CPU usually wasn't doing anything anyway, and remember, it's much faster).

It turns out R added transparent decompression for certain kinds of compressed files in the latest version (2.10).  If you have your files compressed with bzip2, xvz, or gzip they can be read into R as if they are plain text files.  You should have the proper filename extensions.

The command...

myData <- read.table('myFile.gz')  #gzip compressed files have a "gz" extension

Will work just as if 'myFile.gz' were the raw text file.

This is very handy in distributing your analysis scripts and data to co-workers at the same time and saves you space on your hard drive.  More importantly, it saves you space on flash drives and backups.

John

PS:  My compressed 100MB data file is 500KB.  That's tiny these days.  Programs to do the kinds of compression acceptable for R are available for free.  If you have a Mac or Linux computer they are built into your command line and are as simple as typing gzip 'filename'.

bonus tip:  R has built in facilities for writing out the compressed files as well.  "?connections" gets you the help page with basic info.




More information about the R-help mailing list