[R] Hand-crafting an .RData file

Tue Nov 10 07:31:39 CET 2009

Thanks as always for a very helpful response. I'm now loading a few million
rows in only a few seconds.

Cordially,
Adam Kramer

On Mon, 9 Nov 2009, Prof Brian Ripley wrote:

> The R 'save' format (as used for the saved workspace .RData) is described in 
> the 'R Internals' manual (section 1.8).  It is intended for R objects, and 
> you would first have to create one[*] of those in your other application. 
> That seems a lot of work.
>
> The normal way to transfer numeric data between applications is to write a 
> binary file: R can read such files with readBin(), and it also has 
> wrappers/C-code to read a number of commmon binary data formats (e.g. those 
> from SPSS).
>
> With character data there are more issues (and more formats, see also 
> readChar()), but load() is not particularly fast for those.
>
> Ultimately the R functions pay a performance price for their flexibility so 
> hand-crafted C code to read the format can be worthwhile: but see the 
> comments below about whether I/O speed is that important.
>
> [*] the 'save' format is a serialization of a single R object, even if you 
> save many objects, since the object(s) are combined into a pairlist.
>
> On Sun, 8 Nov 2009, Adam D. I. Kramer wrote:
>
>> Hello,
>>
>> 	I frequently have to export a large quantity of data from some
>> source (for example, a database, or a hand-written perl script) and then
>> read it into R.  This occasionally takes a lot of time; I'm usually using
>> read.table("filename",comment.char="",quote="") to read the data once it is
>> written to disk.
>
> Specifying colClasses and nrows will usually help.
>
> To read from a database, packages such as RODBC use binary data transfer: 
> with suitable tuning this can be fast.
>
>> 	However, I *know* that the program that generates the data is more
>> or less just calling printf in a for loop to create the csv or 
>> tab-delimited
>> file, writing, then having R parse it, which is pretty inefficient. 
>> Instead, I am interested in figuring out how to write the data in .RData
>> format so that I can load() it instead of read.table() it.
>
> Without more details it is hard to say if it is inefficient. read.table() can 
> read data pretty fast (millions of items per second) if used following the 
> hints in the 'R Data' manual.  See e.g.
> https://stat.ethz.ch/pipermail/r-devel/2004-December/031733.html
>
> Almost anything non-trivial one might do with such data is much slower.  The 
> trend is to write richer (and slower to read) data formats.
>
>> 	Trolling the internet, however, has not suggested anything about the
>> specification for an .RData file. Could somebody link me to a specification
>> or some information that would instruct me on how to construct a .RData
>> file (either compressed or uncompressed)?
>>
>> 	Also, I am open to other suggestions of how to get load()-like
>> efficiency in some other way.
>> 
>> Many thanks,
>> Adam D. I. Kramer
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>