[R] Version Controlled CRAN Packages
marc_schwartz at me.com
Thu Jan 3 17:38:01 CET 2013
On Jan 3, 2013, at 8:33 AM, Mario Bourgoin <mob at media.mit.edu> wrote:
> Dear Sir or Madam,
> The group of people with whom I work is now convinced of the usefulness of
> using R and its packages to meet our needs for statistical analysis. It
> has become important that R programs and scripts we create today can be run
> by someone else tomorrow, so need to use version-control. For this to work
> well, we need to version-control not just our code, but also R and the CRAN
> packages we use. (We only use CRAN for now.) Fortunately, R is under
> Subversion, and many CRAN packages are under Subversion in R-Forge.
> However, many CRAN packages do not appear to be available from R-Forge.
> 1- Are all CRAN packages available from some repository under version
> control? (My guess is ``no.'')
> 2- Is there an identifier on CRAN that flags a package as under version
> control in a repository? (My guess is ``no.'')
> 3- How does CRAN do version control for non-repository packages? (My guess
> is ``through the generosity of volunteer administrators'' though I would
> prefer that some version control software be involved.)
> 4- Should we decide to create a local source repository to meet our needs?
> (My guess is ``that depends.'')
> 5- Where might I find examples of groups creating and maintaining local
> source repositories for R and its packages?
> Mario Bourgoin
I suspect that you will get various responses, so let me offer my ten cents:
1. The old versions of CRAN packages are typically, but possibly not always, available via an "Old Sources" link on each package's page on CRAN. You could use that approach to obtain old source versions of packages. However, it is conceivable that locally compiling and using the archived source version of that same package (eg. where you may have used a precompiled binary on OSX, Windows or even Linux in some cases) could yield behavioral changes over time. Hardware, OS, compiler and other environmental changes (bugs, 32 versus 64 bit, differing compiler options, etc.) could introduce even subtle problems that may perhaps preclude you absolutely replicating results from previous work. Those are especially important to consider for CRAN packages that are not "pure R" (eg. they include C, C++, FORTRAN, etc.).
2. The old versions of contributed CRAN packages that are physically on CRAN are not under a true file level source version control system there. It is up to each package maintainer/author to elect to use such a tool themselves outside of CRAN. R-Forge and GitHub are perhaps the two most popular online platforms, but others may be used and yet others may use local offline repos that you do not have access to. Some may not use a true version control system at all. There is no requirement for or any enforcement of a particular development process for contributed CRAN packages.
3. While R itself is under SVN control, unless you are compiling R from source and keeping track of SVN rev numbers, that is not likely to be helpful to you, if you typically install precompiled binary versions of R. You will want to archive the OS-specific R binaries that you use.
4. As noted above, it is conceivable that running code today versus running that same code five years from now using the same versions of R and CRAN packages that you used today can be problematic. It is not only R and the CRAN packages that are changing, but your hardware, OS, compilers and possible other relevant tools that are highly likely to change as well. All of these factors can contribute to your ability or inability to exactly replicate results over time. Only you can determine just how much of today's R/CRAN installation and computing environment you need to be able to replicate in the future.
5. If you have datasets that you will be using and need to replicate the same results five years from now on the same dataset that you used today, you will need to maintain your datasets (not just your code) in a version control system as well.
6. You might also want to look into "Reproducible Research".
Bottom line, you have defined or are in the process of defining your own local requirements and perhaps SOPs. Thus, take control of your own risk mitigation process. Implement your own version control system locally, that includes, if you use them, precompiled binaries of R and any CRAN packages that you may use, so that you can replicate the state of an R installation to your own requirements, notwithstanding hardware and OS level changes that will occur.
You will of course want to document the version of R and any third party packages that you use when performing an analysis, so that you can track such information for future use.
If you compile and install source versions of R and CRAN packages, then I would keep source level tarballs of each in said version control system so that you can reasonably ensure access to them when you need it, even though they may also be available via CRAN.
I would be sure that such a repo (or more likely, content/project specific repos) are stored on a central server, which is backed up offline with a sufficient frequency and level of redundancy to mitigate loss risk.
The two most popular VC tools these days are SVN and Git. There are significant differences in the implementation models of both, so you will need to take time to consider your own functional and operational requirements, which would may lead you in one direction or the other. That being said, I made the switch from SVN to Git last year, even though I don't need true distributed version control myself. There are various reasons for that switch, which are beyond the scope of this discussion, so I won't get into details here.
I hope that the above is helpful.
More information about the R-help