[Rd] [RFC] A case for freezing CRAN

Fri Mar 21 13:51:38 CET 2014

On 21 Mar 2014, at 11:08, Rainer M Krug <Rainer at krugs.de> wrote:

> Jari Oksanen <jari.oksanen at oulu.fi> writes:
> 
>> On 21/03/2014, at 10:40 AM, Rainer M Krug wrote:
>> 
>>> 
>>> 
>>> This is a long and (mainly) interesting discussion, which is fanning out
>>> in many different directions, and I think many are not that relevant to
>>> the OP's suggestion. 
>>> 
>>> I see the advantages of having such a dynamic CRAN, but also of having a
>>> more stable CRAN. I prefer CRAN as it is now, but ion many cases a more
>>> stable CRAN might b an advantage. So having releases of CRAN might make
>>> sense. But then there is the archiving issue of CRAN.
>>> 
>>> The suggestion was made to move the responsibility away from CRAN and
>>> the R infrastructure to the user / researcher to guarantee that the
>>> results can be re-run years later. It would be nice to have this build
>>> in CRAN, but let's stick at the scenario that the user should care for
>>> reproducability.
>> 
>> There are two different problems that alternate in the discussion:
>> reproducibility and breakage of CRAN dependencies. Frozen CRAN could
>> make *approximate* reproducibility easier to achieve, but real
>> reproducibility needs stricter solutions. Actual sessionInfo() is
>> minimal information, but re-building a spitting image of old
>> environment may still be demanding (but in many cases this does not
>> matter).
>> 
>> Another problem is that CRAN is so volatile that new versions of
>> packages break other packages or old scripts. Here the main problem is
>> how package developers work. Freezing CRAN would not change that: if
>> package maintainers release breaking code, that would be frozen. I
>> think that most packages do not make distinction between development
>> and release branches, and CRAN policy won't change that.
>> 
>> I can sympathize with package maintainers having 150 reverse
>> dependencies. My main package only has ~50, and it is sure that I
>> won't test them all with new release. I sometimes tried, but I could
>> not even get all those built because they had other dependencies on
>> packages that failed. Even those that I could test failed to detect
>> problems (in one case all examples were \dontrun and passed nicely
>> tests). I only wish that if people *really* depend on my package, they
>> test it against R-Forge version and alert me before CRAN releases, but
>> that is not very likely (I guess many dependencies are not *really*
>> necessary, but only concern marginal features of the package, but CRAN
>> forces to declare those).
> 
We work on these too. So far, for latest CRAN version, we have successfully installed 4999 packages among the 5321 CRAN package on our platform. Regarding conflicts in term of function names, around 2000 packages are clean, but the rest produce more than 11,000 pairs of conflicts (i.e., same function name in different packages). For dependency errors, look at the cited references earlier. It is strange that a large portion of R CMD check errors on CRAN occur and disappear *without any version update* of a package or any of its direct or indirect dependencies! That is, a fraction of errors or warnings seem to appear and disappear without any code update. We have traced back some of these to interaction with the net (e.g., example or vignette downloading data from a server and the server may be sometimes unavailable). So, yes, a complex and difficult topic.

> Breakage of CRAN packages is a problem, to which I can not comment
> much. I have no idea how this could be saved unless one introduces more
> checks, which nobody wants. CRAN is a (more or less) open repository for
> packages written by engineers / programmers but also scientists of other
> fields - and that is the strength of CRAN - a central repository to find
> packages which conform to a minimal standard and format. 
> 
>> 
>> Still a few words about reproducibility of scripts: this can be hardly
>> achieved with good coverage, because many scripts are so very ad
>> hoc. When I edit and review manuscripts for journals, I very often get
>> Sweave or knitr scripts that "just work", where "just" means "just so
>> and so". Often they do not work at all, because they had some
>> undeclared private functionalities or stray files in the author
>> workspace that did not travel with the Sweave document. 
> 
> One reason why I *always* start my R sessions --vanilla and ave a local
> initialization script which I call manually. 
> 
>> I think these
>> -- published scientific papers -- are the main field where the code
>> really should be reproducible, but they often are the hardest to
>> reproduce. 
> 
> And this is completely ouyt of the hands of R / CRAN / ... and in the
> hand of Journals and Authors. But R could provide a framework to make
> this more easy in form of a package which provides functions to make
> this a one-command approach.
> 
>> Nothing CRAN people do can help with sloppy code scientists
>> write for publications. You know, they are scientists -- not
>> engineers.
> 
This would be a first step. Then, people would have to learn how to use, say, Sweave, in order to ensure reproducibility. This begins to be enforced by journal editors or publishers (JSS, or Elsevier comes to mind).

Best,

Philippe

> Absolutely - and I am also a sloppy scientists - I put my code online,
> but hope that not many people ask me later about it.
> 
> Cheers,
> 
> Rainer
> 
>> 
>> Cheers, Jari Oksanen
>>> 
>>> Leaving the issue of compilation out, a package which is creating a
>>> custom installation of the R version which includes the source of the R
>>> version used and the sources of the packages in a on Linux compilable
>>> format, given that the relevant dependencies are installed, would be a
>>> huge step forward. 
>>> 
>>> I know - compilation on Windows (and sometimes Mac) is a serious
>>> problem), but to archive *all* binaries and to re-compile all older
>>> versions of R and all packages would be an impossible task.
>>> 
>>> Apart from that - doing your analysis in a Virtual Machine and then
>>> simply archiving this Virtual Machine, would also be an option, but only
>>> for the more tech savy users.
>>> 
>>> In a nutshell: I think a package would be able to provide the solution
>>> for a local archiving to make it possible to re-run the simulation with
>>> the same tools at a later stage - although guarantees would not be
>>> possible.
>>> 
>>> Cheers,
>>> 
>>> Rainer
>>> -- 
>>> Rainer M. Krug
>>> email: Rainer<at>krugs<dot>de
>>> PGP: 0x0F52F982
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
> 
> -- 
> Rainer M. Krug
> email: Rainer<at>krugs<dot>de
> PGP: 0x0F52F982
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel