[R] Variable passed to function not used in function in select=... in subset

Tue Nov 11 16:01:28 CET 2008

Duncan Murdoch wrote:
> On 11/11/2008 8:53 AM, hadley wickham wrote:
>> On Mon, Nov 10, 2008 at 1:04 PM, Wacek Kusnierczyk
>> <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>>> pardon me, but does this address in any way the legitimate complaint of
>>> the rightfully confused user?
>>>
>>> consider the following:
>>>
>>> d = data.frame(a=1, b=2)
>>> a = c("a", "b")
>>> z = a
>>> # that is, both a and z are c("a", "b")
>>>
>>> subset(d, select=z)
>>> # gives two columns, since z is a two element vector whose elements are
>>> valid column names
>>>
>>> subset(d, select=a)
>>> # gives one column, since 'a' (but not a) is a valid column name
>>>
>>> subset(d, select=c(a,b))
>>> # gives two columns
>>>
>>>
>>> this is certainly what the authors intended, and they may have good
>>> grounds for this smart design.  but this must break the expectation
>>> of a
>>> naive (r-naive, for that matter) user, who may otherwise have excellent
>>> experience in using a functional programming language, e.g., scheme.
>>> (especially scheme, where symbols and expressions are first-class
>>> objects, yet the distinction between a symbol or an expression and
>>> their
>>> referent is made painfully clear, perhaps except for when one hacks
>>> with
>>> macros.)
>>>
>>> the examples above illustrate the notorious problem with r that one can
>>> never tell whether 'a' means "the value referred to with the identifier
>>> 'a'" or "the symbol 'a'", unless one gets ugly surprises and is forced
>>> to study the documentation.  and even then one may not get a clear
>>> answer.
>>
>> I agree, with some caveats.  There are basically two uses of R: as a
>> interactive data analysis package and as a statistical programming
>> language.  These uses come into conflict: in the interactive
>> environment, you want to minimise typing so that you can be as speedy
>> as possible.  It doesn't matter if R occasionally makes a wrong guess
>> when you have specified something implicitly, because you can fix it
>> on the fly.  When you are programming, you care less about saving
>> typing and more about reproducibility.  You want to be explicit so
>> your function is robust to widely varying inputs, even if it means you
>> have to type a lot more.  You see this tension in quite a few places:
>>
>>  * drop = T
>>  * functions that return different types of output (e.g. sapply)
>> depending on input parameters
>>  * partial matching of argument names
>>  * using unevaluated expressions instead of strings (e.g. library,
>> subset, ...)
>>
>> These are all things that are helpful for interactive use, but make
>> life as a programmer more difficult.  I find the last one particularly
>> frustrating because it means it is very difficult to program with some
>> functions (i.e subset) without resorting to complex quoting,
>> substituting and evaluating tricks.  I have tried to steer away from
>> this technique in my packages, and where it's just too convenient for
>> interactive use, insulating the deparsing into special functions that
>> the data analyst must use (e.g. aes() in ggplot, and .() in plyr),
>> along with providing alternatives for the programmer.
>>
>> I don't understand why you're getting so much push-back on this issue.
>>  R is a fantastic language, but it has some genuinely nasty corners.
>> In my opinion, this is one of them.
>
> I think your analysis is correct, that the goals of casual use and
> programming are inconsistent.  But in general I think there's always
> going to be support for providing alternative ways that are
> programmer-safe.

you know, in ipython you can write, e.g., m 1 instead of m(1) to call
the method m on the value 1.  but this is a syntactic shorthand which is
not valid in python, and you can see how it gets translated into python
when you try it.  so you have the cake and you eat it -- there is
consistent (at least, much more consistent than in r) policy on the
syntax, and you can still have conveniences in the interactive interpreter.

r, on the other hand, prefers solutions such as the subset one, which
are the best recipe for confusion.  why would not the r team have a look
at what others are doing?  programming language design has progressed a
lot since the so often cited reference for r was written in 1988.

vQ