[R] Variable passed to function not used in function in select=... in subset

Gabor Grothendieck ggrothendieck at gmail.com
Mon Nov 10 20:37:45 CET 2008


Certainly this has been recognized as a potential problem:

http://developer.r-project.org/nonstandard-eval.pdf

however, it is convenient when you are performing
an analysis and entering commands directly as opposed
to writing a program although possibly the potential ambiguities
overshadow the convenience.

On Mon, Nov 10, 2008 at 2:04 PM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
> pardon me, but does this address in any way the legitimate complaint of
> the rightfully confused user?
>
> consider the following:
>
> d = data.frame(a=1, b=2)
> a = c("a", "b")
> z = a
> # that is, both a and z are c("a", "b")
>
> subset(d, select=z)
> # gives two columns, since z is a two element vector whose elements are
> valid column names
>
> subset(d, select=a)
> # gives one column, since 'a' (but not a) is a valid column name
>
> subset(d, select=c(a,b))
> # gives two columns
>
>
> this is certainly what the authors intended, and they may have good
> grounds for this smart design.  but this must break the expectation of a
> naive (r-naive, for that matter) user, who may otherwise have excellent
> experience in using a functional programming language, e.g., scheme.
> (especially scheme, where symbols and expressions are first-class
> objects, yet the distinction between a symbol or an expression and their
> referent is made painfully clear, perhaps except for when one hacks with
> macros.)
>
> the examples above illustrate the notorious problem with r that one can
> never tell whether 'a' means "the value referred to with the identifier
> 'a'" or "the symbol 'a'", unless one gets ugly surprises and is forced
> to study the documentation.  and even then one may not get a clear answer.
>
> the example given by the confused user is a red flag warning.  it's a
> typical abstraction where a nested sequence of operations (here print
> over names over subset) is abstracted into a single procedure, which can
> be called with whatever arguments are valid:
>
> pns = function(d, g) print(names(subset(d, select=g)))
>
> what sane person, without carefully studying the gory details of subset,
> will ever expect that if the first argument happens to have a column
> named 'g', only this one will be selected, while if it doesn't, subset
> will select the columns named by the components of what 'g' evaluates
> to.  i wonder how many users have *not* noticed that what they get is
> not what they assume they get because of such tricky tricks, and in
> consequence were not able to publish their analyses (or worse, have
> published them).
>
> what is scary is that this may happen with about any other function in
> r, because the design is pervasive.  no one should ever use any r
> function without first carefully reading the docs (which is not
> guaranteed to help) or trying it first on a number of carefully crafted
> test cases.  if such care is not taken, results obtained with r cannot
> be taken seriously.
>
>
> vQ
>
>
> Gabor Grothendieck wrote:
>> Forgot the name part.  Try:
>>
>> TestFunc2 <- function(DF, group) names(DF[group])
>> TestFunc3 <- function(...) names(subset(..., subset = TRUE))
>> TestFunc4 <- function(...) eval.parent(names(subset(..., subset = TRUE)))
>>
>> # e.g.
>> df1 <- data.frame(group = "G1", visit = "V1", value = 0.9)
>> TestFunc2(df1, c("group", "visit"))
>> TestFunc3(df1, c("group", "visit"))
>> TestFunc4(df1, c("group", "visit"))
>> TestFunc4(df1, c(group, visit)) # this works too
>>
>> On Mon, Nov 10, 2008 at 10:43 AM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>>
>>> Here are a few things to try:
>>>
>>> TestFunc1 <- get("[")
>>>
>>> TestFunc2 <- function(DF, group) DF[group]
>>>
>>> TestFunc3 <- function(...) subset(..., subset = TRUE)
>>>
>>>
>>>
>>> On Mon, Nov 10, 2008 at 10:18 AM, Karl Knoblick <karlknoblich at yahoo.de> wrote:
>>>
>>>> Hello!
>>>>
>>>> I have the problem that in my function the passed variable is not used, but the variable name of the dataframe itself - difficult to explain, but an easy example:
>>>>
>>>> TestFunc<-function(df, group) {
>>>>     print(names(subset(df, select=group)))
>>>> }
>>>> df1<-data.frame(group="G1", visit="V1", value=0.9)
>>>> TestFunc(df1, c("group", "visit"))
>>>>
>>>> Result:
>>>> [1] "group"
>>>>
>>>> But I expected and want to have [1] "group" "visit" as result! Does anybody know how to get this result?
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>
>



More information about the R-help mailing list