[R] Q: Suggestions for long-term data/program storage policy?

Duncan Murdoch murdoch at stats.uwo.ca
Tue Oct 11 22:42:39 CEST 2005


Berton Gunter wrote:
> A general comment. 
> 
> As usual, Brian is right on target. Indeed, this has been written,
> conferenced, agonized, kvetched,  etc. about extensively in the computer
> science community (and no doubt, among many others ... like accountants). I
> seem to remember reading a Scientific American Magazine article (or was it
> Science) about 10-15 years ago. As Brian says, it's not only application
> versions, applications, OS's -- but even hardware that goes obsolete. Do you
> have any data on 5 1/4" floppies from appications written for CP/M running
> on an Intel 8080? Think of poor banks, drug companies -- or the census
> bureau -- who have to keep their data forever. I sometimes wonder if all
> these bits and bytes will fill up all the earth's storage eventually? :-)
> 
> Anyway, you might try researching this in the CS literature to see what the
> strategy du jour is for this.

Now that journals are becoming electronic, librarians are also very 
concerned with this problem, and they tend to have very long term 
storage goals.

Duncan Murdoch
> 
> Cheers,
> 
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>  
> "The business of the statistician is to catalyze the scientific learning
> process."  - George E. P. Box
>  
>  
> 
> 
>>-----Original Message-----
>>From: r-help-bounces at stat.math.ethz.ch 
>>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Prof 
>>Brian Ripley
>>Sent: Tuesday, October 11, 2005 4:05 AM
>>To: Alexander Ploner
>>Cc: r-help at stat.math.ethz.ch
>>Subject: Re: [R] Q: Suggestions for long-term data/program 
>>storage policy?
>>
>>On Tue, 11 Oct 2005, Alexander Ploner wrote:
>>
>>
>>>we are a statistical/epidemiological departement that - after a few
>>>years of rapid growth - finally is getting around to formulate a
>>>general data storage and retention policy - mainly to ensure that we
>>>can reproduce results from published papers/theses easier in the
>>>future, but also with the hope that we get more synergy between
>>>related projects.
>>>
>>>We have formulated what we feel is a reasonable draft, requiring
>>>basically that the raw data, all programs to create derived data
>>>sets, and the analysis programs are stored and documented in a
>>>uniform manner, regardless of the analysis software used. 
>>
>>The minimum
>>
>>>data retention we are aiming for is 10 years, and the format for the
>>>raw data is quite sane (either flat ASCII or real
>>
>>You are intending to retain copies of the OS used and hardware too?
>>The results depend far more on those than you apparently realize.
>>
>>
>>>Given the rapid devlopment cycle of R,
>>
>>I think you will find your OS changes as fast: all those 
>>security updates 
>>potentially affect your results.
>>
>>
>>>this suggests that at the very least all non-base packages 
>>
>>used in the 
>>
>>>analysis are stored together with each project. I have 
>>
>>basically two 
>>
>>>questions:
>>>
>>>1) Are old R versions (binaries/sources) going to be available on
>>>CRAN indefinitely?
>>
>>Not binaries.  The intention is that source files be 
>>available, but they 
>>could become corrupted (as it seems the Windows binary has for a past 
>>version).
>>
>>
>>>2) Is .RData a reasonable file format for long term storage?
>>
>>I would say not, as it is almost impossible to recover from 
>>any corruption 
>>in such a file.  We like to have long-term data in a human-readable 
>>printout, with a print copy, and also store some checksums.
>>
>>
>>>I would also be very grateful for any other suggestions, comments or
>>>links for setting up and implementing such a storage policy (R-
>>>specific or otherwise).
>>
>>You need to consider the medium on which you are going to store the 
>>archive.  We currrently use CD-R (and not tapes as those are less 
>>compatible across drives -- we have two identical drives 
>>currently but do 
>>not expect either to last 10 years), and check them annually 
>>-- I guess we 
>>will re-write to another medium after much less than 10 years.
>>
>>-- 
>>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>>University of Oxford,             Tel:  +44 1865 272861 (self)
>>1 South Parks Road,                     +44 1865 272866 (PA)
>>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide! 
>>http://www.R-project.org/posting-guide.html
>>
> 
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html




More information about the R-help mailing list