[Rd] A few suggestions and perspectives from a PhD student
antonink at idi.ntnu.no
Fri May 5 19:00:09 CEST 2017
Dear Sir or Madam,
I am in 2nd year of my PhD in bioinformatics, after taking my Master’s in computer science, and have been using R heavily during my PhD. As such, I have put together a list of certain features in R that, in my opinion, would be beneficial to add, or could be improved. The first two are already implemented in packages, but given that it is implemented as user-defined operators, it greatly restricts its usefulness. I hope you will find my suggestions interesting. If you find time, I will welcome any feedback as to whether you find the suggestions useful, or why you do not think they should be implemented. I will also welcome if you enlighten me with any features I might be unaware of, that might solve the issues I have pointed out below.
Currently available in package magrittr, piping makes the code better readable by having the line start at its natural starting point, and following with functions that are applied - in order. The readability of several nested calls with a number of parameters each is almost zero, it’s almost as if one would need to come up with the solution himself. Pipeline in comparison is very straightforward, especially together with the point (2).
The package here works rather good nevertheless, the shortcomings of piping not being native are not quite as severe as in point (2). Nevertheless, an intuitive symbol such as | would be helpful, and it sometimes bothers me that I have to parenthesize anonymous function, which would probably not be required in a native pipe-operator, much like it is not required in f.ex. lapply. That is,
1:5 %>% function(x) x+2
should be totally fine
Currently available in package Curry. The idea is that, having a function such as foo = function(x, y) x+y, one would like to write for example lapply(foo(3), 1:5), and have the interpreter figure out ok, foo(3) does not make a value result, but it can still give a function result - a function of y. This would be indeed most useful for various apply functions, rather than writing function(x) foo(3,x).
I suggest that currying would make the code easier to write, and more readable, especially when using apply functions. One might imagine that there could be some confusion with such a feature, especially from people unfamiliar with functional programming, although R already does take function as first-order arguments, so it could be just fine. But one could address it with special syntax, such as $foo(3) [$foo(x=3)] for partial application. The current currying package has very limited usefulness, as, being limited by the user-defined operator framework, it only rarely can contribute to less code/more readability. Compare yourself:
$foo(x=3) vs foo %<% 3
goo = function(a,b,c)
$goo(b=3) vs goo %><% list(b=3)
Moreover, one would often like currying to have highest priority. For example, when piping:
data %>% foo %>% foo1 %<% 3
if one wants to do data %>% foo %>% $foo(x=3)
3) Code executable only when running the script itself
Whereas the first two suggestions are somewhat stealing from Haskell and the like, this suggestion would be stealing from Python. I’m building quite a complicated pipeline, using S4 classes. After defining the class and its methods, I also define how to build the class to my likings, based on my input data, using various now-defined methods. So I end up having a list of command line arguments to process, and the way to create the class instance based on them. If I write it to the class file, however, I end up running the code when it is sourced from the next step in the pipeline, that needs the previous class definitions.
A feature such as pythonic “if __name__ == __main__” would thus be useful. As it is, I had to create run scripts as separate files. Which is actually not so terrible, given the class and its methods often span a few hundred lines, but still.
4) non-exported global variables
I also find it lacking, that I seem to be unable to create constants that would not get passed to files that source the class definition. That is, if class1 features global constant CONSTANT=3, then if class2 sources class1, it will also include the constant. This 1) clutters the namespace when running the code interactively, 2) potentially overwrites the constants in case of nameclash. Some kind of export/nonexport variable syntax, or symbolic import, or namespace would be useful. I know if I converted it to a package I would get at least something like a namespace, but still.
I understand that the variable cannot just not be imported, in general, as the functions will generally rely on it (otherwise it wouldn’t have to be there). But one could consider hiding it in an implicit namespace for the file, for example.
5) S4 methods with same name, for different classes
Say I have an S4 class called datasetSingle, and another S4 class called datasetMulti, which gathers up a number of datasetSingle classes, and adds some extra functionality on top. The datasetSingle class may have a method replicates, that returns a named vector assigning replicate number to experiment names of the dataset. But I would also like to have a function with the same name for the datasetMulti class, that returns for data frame, or list, covering replicate numbers for all the datasets included.
But then, I need to setGeneric for the method. But if I set generic before both implementations, I will reset the generic in the second call, losing the definition for “replicates” for datasetSingle. Skipping this in the code for datasetMulti means that 1) I have to remember that I had the function defined for datasetSingle, 2) if I remove the function or change its name in datasetSingle, I now have to change the datasetMulti class file too. Moreover, if I would like to have a different generic for the datasetMulti version, I have to change it not in datasetMulti class file, but in the datasetSingle file, where it might not make much sense. In this case, I wanted to have another argument “datasets”, which would return the replicates only for the datasets specified, rather than for all.
I made a wrapper that could circumvent the first issue, but the second issue is not easy to circumvent.
6) Many parameters freeze S4 method calls
If I specify ca over 6 parameters for an S4 method, I would often get a “freeze” on the method call. The process would eat up a lot of memory before going into the call, upon which it would execute the call as normal (if it didn’t run out of memory or I didn’t run out of patience). Subsequent calls of the method would not include this overhead. The amount of memory this could take could be in gigabytes, and the time in minutes. I suspect this might be due to generating an entry in call table for each accepted signature. It can be circumvented, but sure isn’t a behaviour one would expect.
7) Default values for S4 methods
It would seem that it is not possible to set up default parameters for an S4 method in a usual way of definiton = function (x, y=5). I resorted to making class unions with “missing” for signatures on the call, with the call starting with if(missing(param)) param=DEFAULT_VALUE, but it certainly does not improve readability or ease of coding.
Thank you for your time if you have finished reading thus far. :) Looking forward to any answer.
More information about the R-devel