[R] Variable passed to function not used in function in select=... in subset

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Tue Nov 11 11:08:26 CET 2008


Gavin Simpson wrote:
>
>> d = data.frame(a = 1)
>> d$`-b` = 2
>> names(d)
>> # here we go
>>
>> subset(d, select = -b)
>> # to b or not to b?
>>     
>
> but -b is not the name of the column; you explicitly called it `-b` and
> you should refer to it as such. If you use "non-standard" names then
> expect to do a bit more work.
>   
identical(names(d)[2], "-b")

if i do

d$`c` = 4

then you claim d has no column named 'c'?  do i have to refer to the c
column as `c`?


>   
>> subset(d, select = `-b`)
>>     
>   -b
> 1  2
>   

... and i have to use

subset(d, select = `a`)

and not

subset(d, select = a)

right?  besides, subset(d, select = `-b`) should rather return the
column(s) whose names are the value of the variable `-b`:

`-a` = "a"
subset(d, select = `-a`)
# returns all columns except for the one named 'a', rather than the
column named '-a' -- but that's just because there is no such column in
d;  if there were, this one would be returned. 

so even with backquotes used, there is no obvious interpretation of what
select=`-b`should mean, because it depends on what names components of
the first argument have.  and this breaks the concept of referential
transparency.

so the problem is not so easily explained away.  what subset does *is*
messy.


>> subset(d, select = - `-b`)
>>     
>   a
> 1 1
>
>   
>> b = "a"
>> subset(d, select = -b)
>> # tragedy
>>     
>
> For this, I interpret it as not finding a column named b so tries to
> evaluate:
>
>   

you interpret it.  how obvious is this for most users?
it tries to find a column named 'b', not a column named b.  that's the
problem with subset.


>> b = "a"
>> `-`(b)
>>     
> Error in -b : invalid argument to unary operator
>
> `-` is a function remember.
>
> If you want this to work you can use get()
>
>   
>> subset(d, select = - get(b))
>>     
>   -b
> 1  2
>
>   

"use this hack to get around the design."

>> d$b = 3
>> subset(d, select = -b)
>> # catharsis
>>
>> (for whatever reason a user may choose to have a column named '-b')
>>     
>
> Yes, but the user is warned about not using standard naming conventions
> in the Introduction to R manual. You aren't stopped from using names
> like `-b` but if you use them, you have to expect to work a little
> harder.
>   

i'd like you to point me to that warning, as i apparently need to read
it, but i haven't found it in the manual yet.  thanks.

> Reading ?subset we have:
>
>   select: expression, indicating columns to select from a data frame.
>
> ....
>
>      For data frames, the 'subset' argument works on the rows.  Note
>      that 'subset' will be evaluated in the data frame, so columns can
>      be referred to (by name) as variables in the expression (see the
>      examples).
>
> which I think is reasonably explicit is it not? 

about?  it says nothing about how the expression passed as the select
argument is treated.  it just says that the select argument is an
expression indicating columns (but how?), and then, in the middle of
explaining the subset parameter, it mentions that columns can be
referred to by name as variables in the expression.  how clear is this?

the following does not work -- i'd expect it to, by virtue the clear
explanation:

d = data.frame(a=1, b=2)
subset(d, select=c(a, "b"))
# what??  it does not break any 'specification' given in the docs


> It explains why your
> second example fails and why '- get(b)' doesn't, and also why your other
> examples don't give you what you want. You aren't using the appropriate
> 'name'.
>   
that's still too confusing.  ?get:

get(x, ...)

x: a variable name (given as a character string)

so:

get("b")
# "a", because we get the variable b, whose value is "a"

get(b)
# variable "a" not found

in '-get(b)', get(b) should evaluate to the value of the variable named
in b; b is "a", so get should lookup the value of the variable a, but
there is none (unless you defined it), so this should break.  instead,
'get(b)' is replaced with 'a', and '-a' in subset(d, select=-a) is not
treated as an application of the function `-`to the variable a, but
literally as the specification 'but column named 'a''. 

it must be painfully obvious to a casual user.


> I'm sure we could all find aspects of R that don't work in exactly the
> way we might preconceive or think of as being intuitive. 

most of it, seems like.

> But if it works
> as documented 

in many cases, the documentation is insufficient, confusing, and
unhelpful when it comes to this sort of what you might call 'optimizations'.

> then I don't see what the problem is unless i) you are
> offering to rewrite the code to make it "work better", ii) that R Core
> thinks any proposal "works better" and iii) in doing so it doesn't break
> most of the R code out there in R itself or in add-on packages.
>   

i'd prefer r to work better rather than "work better".  i'm afraid that
serious improvements to r must, by necessity, break quite a lot of
earlier code, which exploits, if only due the impossibility of not doing
so, such design.

it certainly is a good idea to offer to contribute and i'd be happy to
do so, but i wouldn't be given a chance, i suppose.
besides, i try not to imagine what hides under the surface of a language
with such a design.

vQ



More information about the R-help mailing list