[R] Variable passed to function not used in function in select=... in subset

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Mon Nov 10 22:17:45 CET 2008

Gabor Grothendieck wrote:
> Certainly this has been recognized as a potential problem:
> http://developer.r-project.org/nonstandard-eval.pdf
> however, it is convenient when you are performing
> an analysis and entering commands directly as opposed
> to writing a program although possibly the potential ambiguities
> overshadow the convenience.

in most cases, i do not see why one could not use a string literal
passed by value instead of having an expression deparsed within the
function, which may lead to confusing behaviour.  this would give much
more consistent and predictable code.  this has nothing to do with the
evaluation mechanism, which can still be lazy. 

in the case of subset, i do not really see how this design might be
helpful, but it's easy to see how it can be harmful, examples have just
been given.  the convenience here is at most up to being able to omit
quotes, at the risk of having columns selected where they should not,
and vice versa.  the worst thing is that it destroys the benefit of
lexical scoping:

subset(d, select=group)

did the programmer intend to select the column named 'group'?  or the
columns whose names appear in the vector group?  is d supposed not to
have a column named 'group', should one change the identifier if d does
have such a column, to avoid selecting that column instead of whatever
else would be selected?  etc.  could this not be written as

subset(d, select="group") 

(two extra characters), and have it cleanly and always mean 'pick the
one column named 'group''? 

so there are actually three problems here:
- one that a programmer may be unaware that her own code not do what she
- another that a user may unaware of that the code she uses performs
this way;
- another that a user may not be sure whether the code may be reused as
is, or must be modified so as not to interfere with the particular data.

the dependence of subset's behaviour on the particular data it is
applied to is confusing.  and here's an example of how it breaks its own
smart semantics:

d = data.frame(a=1)
d$`c(a,b)` = 2
# no problem, two columns
# one named 'c(a,b)'

subset(d, select=c(a,b))
# so what?  the expression given to select certainly is a valid and
actual name of a column in d, but subset complains there's no such
column (well, it actually says object "b" not found, by which it
probably means that object b, i.e., object named 'b', has not been
found.  not only uninformative as a message in this situation, but also
revealing the pervasive confusion of the name and the named, as the
object "b" -- a one-character string -- has not been mentioned here at
all.  what a mess.)

this can't possibly be considered good design, can it?  the dubious
benefit is heavily outweighed by the drawbacks.


More information about the R-help mailing list