[Rd] A few suggestions and perspectives from a PhD student

Joris Meys jorismeys at gmail.com
Tue May 9 11:22:48 CEST 2017

On Tue, May 9, 2017 at 9:47 AM, Hilmar Berger <berger at mpiib-berlin.mpg.de>

> Hi,
> On 08/05/17 16:37, Ista Zahn wrote:
>> One of the key strengths of R is that packages are not akin to "fan
>> created mods". They are a central and necessary part of the R system.
>> I would tend to disagree here. R packages are in their majority not
> maintained by the core R developers. Concepts, features and lifetime depend
> mainly on the maintainers of the package (even though in theory GPL will
> allow to somebody to take over anytime). Several packages that are critical
> for processing big data and providing "modern" visualizations introduce
> concepts quite different from the legacy S/R language. I do feel that in a
> way, current core R shows strongly its origin in S, while modern concepts
> (e.g. data.table, dplyr, ggplot, ...) are often only available via
> extension packages. This is fine if one considers R to be a statistical
> toolkit; as a programming language, however, it introduces inconsistencies
> and uncertainties which could be avoided if some of the "modern" parts
> (including language concepts) could be more integrated in core-R.
> Best regards,
> Hilmar

And I would tend to disagree here. R is build upon the paradigm of a
functional programming language, and falls in the same group as clojure,
haskell and the likes. It is a turing complete programming language on its
own. That's quite a bit more than "a statistical toolkit". You can say that
about eg the macro language of SPSS, but not about R.

Second, there's little "modern" about the ideas behind the tidyverse.
Piping is about as old as unix itself. The grammar of graphics, on which
ggplot is based, stems from the SYStat graphics system from the nineties.
Hadley and colleagues did (and do) a great job implementing these ideas in
R, but the ideas do have a respectable age.

Third, there's a lot of nonstandard evaluation going on in all these
packages. Using them inside your own functions requires serious attention
(eg the difference between aes() and aes_() in ggplot2). Actually, even
though I definitely see the merits of these packages in data analysis, the
tidyverse feels like a (clean and powerful) macro language on top of R. And
that's good, but that doesn't mean these parts are essential to transform R
into a programming language. Rather the contrary actually: too heavily
relying on these packages does complicate things when you start to develop
your own packages in R.

Forth, the tidyverse masks quite some native R functions. Obviously they
took great care in keeping the functionality as close as one would expect,
but that's not always the case. The lag() function of dplyr() masks an S3
generic from the stats package for example. So if you work with time series
in the stats package, loading the tidyverse gives you trouble.

Fifth, many of the tidyverse packages are a version 0.x.y : they're still
in beta development and their functionality might (and will) change.
Functions disappear, arguments are called different, tags change,... Often
the changes improve the packages, but they did break older code for me more
than once. You can't expect the R core team to incorporate something that
is bound to change.

Last but not least, the tidyverse actually sometimes works against new R
users. At least R users that go beyond the classic data workflow. I
literally rewrote some code -from a consultant- that abused the _ply
functions to create nested loops. Removing all that stuff and rewriting the
code using a simple list in combination with a simple for-loop, sped up the
code with a factor 150. That has nothing to do with dplyr, it's very fast.
That has everything to do with that person having a hammer and thinking
everything he sees is a nail. The tidyverse is no reason to not learn the
concepts of the language it's built upon.

The one thing I would like to see though, is the adaptation of the
statistical toolkit so that it can work with data.table and tibble objects
directly, as opposed to having to convert to a data.frame once you start
building the models. And I believe that eventually there will be a
replacement for the data.frame that increases R's performance and lessens
its burden on the memory.

So all in all, I do admire the tidyverse and how it speeds up data
preparation for analysis. But tidyverse is a powerful data toolkit, not a
programming language. And it won't make R a programming language either.
Because R is already.


> --
> Dr. Hilmar Berger, MD
> Max Planck Institute for Infection Biology
> Charitéplatz 1
> D-10117 Berlin
> Phone:  + 49 30 28460 430
> Fax:    + 49 30 28460 401
>  E-Mail: berger at mpiib-berlin.mpg.de
> Web   : www.mpiib-berlin.mpg.de
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
Joris.Meys at Ugent.be
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

More information about the R-devel mailing list