[Rd] [RFC] A case for freezing CRAN

Karl Millar kmillar at google.com
Thu Mar 20 05:30:32 CET 2014


I think what you really want here is the ability to easily identify
and sync to CRAN snapshots.

The easy way to do this is setup a CRAN mirror, but back it up with
version control, so that it's easy to reproduce the exact state of
CRAN at any given point in time.  CRAN's not particularly large and
doesn't churn a whole lot, so most version control systems should be
able to handle that without difficulty.

Using svn, mod_dav_svn and (maybe) mod_rewrite, you could setup the
server so that e.g.:
   http://my.cran.mirror/repos/2013-01-01/
is a mirror of how CRAN looked at midnight 2013-01-01.

Users can then set their repository to that URL, and will have a
stable snapshot to work with, and can have all their packages built
with that snapshot if they like.  For reproducibility purposes, all
users need to do is to agree on the same date to use.  For publication
purposes, the date of the snapshot should be sufficient.

We'd need a version of update.packages() that force-syncs all the
packages to the version in the repository, even if they're downgrades,
but otherwise it ought to be fairly straight-forward.

FWIW, we do something similar internally at Google.  All the packages
that a user has installed come from the same source control revision,
where we know that all the package versions are mutually compatible.
It saves a lot of headaches, and users can rollback to any previous
point in time easily if they run into problems.


On Wed, Mar 19, 2014 at 7:45 PM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
> On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
> <michael.weylandt at gmail.com> wrote:
>> Reading this thread again, is it a fair summary of your position to say "reproducibility by default is more important than giving users access to the newest bug fixes and features by default?" It's certainly arguable, but I'm not sure I'm convinced: I'd imagine that the ratio of new work being done vs reproductions is rather high and the current setup optimizes for that already.
>
> I think that separating development from released branches can give us
> both reliability/reproducibility (stable branch) as well as new
> features (unstable branch). The user gets to pick (and you can pick
> both!). The same is true for r-base: when using a 'released' version
> you get 'stable' base packages that are up to 12 months old. If you
> want to have the latest stuff you download a nightly build of r-devel.
> For regular users and reproducible research it is recommended to use
> the stable branch. However if you are a developer (e.g. package
> author) you might want to develop/test/check your work with the latest
> r-devel.
>
> I think that extending the R release cycle to CRAN would result both
> in more stable released versions of R, as well as more freedom for
> package authors to implement rigorous change in the unstable branch.
> When writing a script that is part of a production pipeline, or sweave
> paper that should be reproducible 10 years from now, or a book on
> using R, you use stable version of R, which is guaranteed to behave
> the same over time. However when developing packages that should be
> compatible with the upcoming release of R, you use r-devel which has
> the latest versions of other CRAN and base packages.
>
>
>> What I'm trying to figure out is why the standard "install the following list of package versions" isn't good enough in your eyes?
>
> Almost nobody does this because it is cumbersome and impractical. We
> can do so much better than this. Note that in order to install old
> packages you also need to investigate which versions of dependencies
> of those packages were used. On win/osx, users need to manually build
> those packages which can be a pain. All in all it makes reproducible
> research difficult and expensive and error prone. At the end of the
> day most published results obtain with R just won't be reproducible.
>
> Also I believe that keeping it simple is essential for solutions to be
> practical. If every script has to be run inside an environment with
> custom libraries, it takes away much of its power. Running a bash or
> python script in Linux is so easy and reliable that entire
> distributions are based on it. I don't understand why we make our
> lives so difficult in R.
>
> In my estimation, a system where stable versions of R pull packages
> from a stable branch of CRAN will naturally resolve the majority of
> the reproducibility and reliability problems with R. And in contrast
> to what some people here are suggesting it does not introduce any
> limitations. If you want to get the latest stuff, you either grab a
> copy of r-devel, or just enable the testing branch and off you go.
> Debian 'testing' works in a similar way, see
> http://www.debian.org/devel/testing.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list