[R] How to reach the column names in a huge .RData file without loading it

Thu Mar 17 08:22:29 CET 2016

Jan T Kim <jttkim at googlemail.com> writes:

> On Wed, Mar 16, 2016 at 03:18:27PM -0400, Duncan Murdoch wrote:
>> On 16/03/2016 1:40 PM, Jan Kim wrote:
>> >Barry: that's an interesting hack.
>> >
>> >I do feel compelled to make two comments, though, regarding the
>> >general issue rather than the scraping idea:
>> >
>> >(1) If your situation is that that image (.RData file) is the only
>> >copy of the data, you'll need to rescue the data from that as soon as
>> >possible anyway. Something like
>> >
>> >     load(".RData");
>> >     write.csv(mydataframe, file = "mydata.csv");
>> >
>> >should do this trick. It will be slow, but you'll need to do it just
>> >once, so you might as well enjoy your coffee while you wait. From that
>> >point on, work with the mydata.csv file for getting at the colnames
>> >(and anything else as well).
>> >
>> >(2) If there's any chance / risk that scraping data off images is not
>> >a one-off, the time to prevent that from catching on is now. If data is
>> >of any value at all, it should be handled in a sane, portable, textual
>> >format. For tabular data, csv is normally adequate or at least good
>> >enough, but .RData images are never a good idea.
>> 
>> I agree with the sentiment, but not with the choice of .csv as a
>> "sane, portable, textual format".  CSV has no type information
>> included, so strings that contain only digits can turn into numbers
>> (and get rounded in the process), things that look like
>> dates can get converted to different formats, etc.
>
> I entirely agree. In hindsight, I should have stated that the .RData files,
> as well as the R code to load and extract stuff from them, should be stored
> permanently and documented.
>
>> The .RData format has the disadvantages of being hard to use outside
>> R, but at least it is usable in R.
>
> yes -- that's why I thought it's a good idea to use R to pluck out the
> valuable data, so (1) they can still be accessed even if the .RData
> format changes and (2) they're in their own file, separated from the
> (potentially homungous, see my P.S.) amount of other stuff caught up
> in the image.
>
> But to reiterate, the .RData file should be secured as well if that's
> the only remaining primary / original source of the data.
>
>> I don't know what I'd recommend if I wanted a portable textual
>> format.  JSON is close, but it can't handle the full
>> range of data that R can handle (e.g. no Inf).  dput() on a
>> dataframe is text, but nothing but R can read it.
>
> yes, that's the problem with "JSON", it's a JavaScript but not really
> an object notation, as it doesn't store class structure metadata.
>
> So again, the best bet is to secure multiple levels, the .RDdata
> image to preserve the R types, the R script to be able to identify
> the relevant variable(s), and the text version to avoid depending on
> availablility of R / an R version still able to read the image format.
>
> Best regards, Jan

The package 'h5' provides an R interface to HDF5 files.  I have used
neither, but am aware that HDF5 is a widely used format for storing
complex data structures.  Would that be useful?

Cheers,

Loris

[snip (99 lines)]
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de