[R] Variable passed to function not used in function in select=... in subset

Gavin Simpson gavin.simpson at ucl.ac.uk
Tue Nov 11 14:58:03 CET 2008


On Tue, 2008-11-11 at 11:08 +0100, Wacek Kusnierczyk wrote:
> Gavin Simpson wrote:
> >
> >> d = data.frame(a = 1)
> >> d$`-b` = 2
> >> names(d)
> >> # here we go
> >>
> >> subset(d, select = -b)
> >> # to b or not to b?
> >>     
> >
> > but -b is not the name of the column; you explicitly called it `-b` and
> > you should refer to it as such. If you use "non-standard" names then
> > expect to do a bit more work.
> >   
> identical(names(d)[2], "-b")
> 
> if i do
> 
> d$`c` = 4
> 
> then you claim d has no column named 'c'?

No, where do you get that from?

>   do i have to refer to the c
> column as `c`?

No, but then "c" is a name that doesn't need to be quoted. -b is a name
that needs to be quoted and if you quote it, things work as you might
expect.

> 
> 
> >   
> >> subset(d, select = `-b`)
> >>     
> >   -b
> > 1  2
> >   
> 
> ... and i have to use
> 
> subset(d, select = `a`)
> 
> and not
> 
> subset(d, select = a)
> 
> right?

Is "a" a name in d? You can quote it if you want but it doesn't need to
be quoted, so you can use either.

>   besides, subset(d, select = `-b`) should rather return the
> column(s) whose names are the value of the variable `-b`:
> 
> `-a` = "a"
> subset(d, select = `-a`)
> # returns all columns except for the one named 'a', rather than the
> column named '-a' -- but that's just because there is no such column in
> d;  if there were, this one would be returned. 

No, it returns a if you are following on from your original examples.
`-a` refers to a variable (object) and that evaluates to "a" and "a" is
component of d so is returned.

> 
> so even with backquotes used, there is no obvious interpretation of what
> select=`-b`should mean, because it depends on what names components of
> the first argument have.  and this breaks the concept of referential
> transparency.
> 
> so the problem is not so easily explained away.  what subset does *is*
> messy.

In your opinion.

And without wanting to be rude or anything, your opinion carries very
little weight in a project like R. You've arrived on the list and been
very critical of the work of others. Now there is nothing wrong with
being critical if it is constructive, and additionally with something
like R you need to be constructive *and* contribute back. I'm not saying
that if you did patch R to work the way you think is correct R Core will
accept them as they need to maintain backwards compatibility and with S
and not annoy the hundreds of package authors. but coming on here and
criticising the work of others isn't going to win you many friends.

Also, subset (and the other things you've been harping on about) work as
documented. So you kind of have to like it or lump it.

> 
> 
> >> subset(d, select = - `-b`)
> >>     
> >   a
> > 1 1
> >
> >   
> >> b = "a"
> >> subset(d, select = -b)
> >> # tragedy
> >>     
> >
> > For this, I interpret it as not finding a column named b so tries to
> > evaluate:
> >
> >   
> 
> you interpret it.  how obvious is this for most users?
> it tries to find a column named 'b', not a column named b.  that's the
> problem with subset.

If users read the documentation then they'd know about unary operators.

> 
> 
> >> b = "a"
> >> `-`(b)
> >>     
> > Error in -b : invalid argument to unary operator
> >
> > `-` is a function remember.
> >
> > If you want this to work you can use get()
> >
> >   
> >> subset(d, select = - get(b))
> >>     
> >   -b
> > 1  2
> >
> >   
> 
> "use this hack to get around the design."

No hack, that is what get() is for. b is *not* a component of d. - b (or
`-`(b) evaluates to an error. If you want to select columns except the
column referenced by the contents of b (which is "a") then you can use
get().

> 
> >> d$b = 3
> >> subset(d, select = -b)
> >> # catharsis
> >>
> >> (for whatever reason a user may choose to have a column named '-b')
> >>     
> >
> > Yes, but the user is warned about not using standard naming conventions
> > in the Introduction to R manual. You aren't stopped from using names
> > like `-b` but if you use them, you have to expect to work a little
> > harder.
> >   
> 
> i'd like you to point me to that warning, as i apparently need to read
> it, but i haven't found it in the manual yet.  thanks.

You could look at section 1.8 of An Introduction to R for a
starter. ?Syntax is also a logical place to start and it explicitly
refers you to details in the See Also section. If you read all of those
(but I'll save you some time and point you to ?Quotes) you find the
answers to how things like this work. ?Quotes explains what are
syntactic names and how to use '`' backticks to quote non-syntactic
names.

Ok, ?Syntax and ?Quotes may not jump out at you as being very obvious
places to look. If so, grab the source to the introduction to R manual,
find a logical place to put this information or note to point people to
the help pages and patch it accordingly. Then contribute that back to
good of everyone.

> 
> > Reading ?subset we have:
> >
> >   select: expression, indicating columns to select from a data frame.
> >
> > ....
> >
> >      For data frames, the 'subset' argument works on the rows.  Note
> >      that 'subset' will be evaluated in the data frame, so columns can
> >      be referred to (by name) as variables in the expression (see the
> >      examples).
> >
> > which I think is reasonably explicit is it not? 
> 
> about?  it says nothing about how the expression passed as the select
> argument is treated.  it just says that the select argument is an
> expression indicating columns (but how?), and then, in the middle of
> explaining the subset parameter, it mentions that columns can be
> referred to by name as variables in the expression.  how clear is this?
> 
> the following does not work -- i'd expect it to, by virtue the clear
> explanation:
> 
> d = data.frame(a=1, b=2)
> subset(d, select=c(a, "b"))
> # what??  it does not break any 'specification' given in the docs

Where is a? What is it. And this is where you'll have to delve into the
code or read more of the manuals or, how about you stop being so
critical of peoples work and ask the people who do know why this doesn't
work.

> 
> > It explains why your
> > second example fails and why '- get(b)' doesn't, and also why your other
> > examples don't give you what you want. You aren't using the appropriate
> > 'name'.
> >   
> that's still too confusing.  ?get:
> 
> get(x, ...)
> 
> x: a variable name (given as a character string)
> 
> so:
> 
> get("b")
> # "a", because we get the variable b, whose value is "a"
> 
> get(b)
> # variable "a" not found
> 
> in '-get(b)', get(b) should evaluate to the value of the variable named
> in b; b is "a", so get should lookup the value of the variable a, but
> there is none (unless you defined it), so this should break.  instead,
> 'get(b)' is replaced with 'a', and '-a' in subset(d, select=-a) is not
> treated as an application of the function `-`to the variable a, but
> literally as the specification 'but column named 'a''. 
> 
> it must be painfully obvious to a casual user.

But a *is* in d and as a is evaluated there it doesn't break and that is
what the subset man pages is telling you.

> 
> 
> > I'm sure we could all find aspects of R that don't work in exactly the
> > way we might preconceive or think of as being intuitive. 
> 
> most of it, seems like.

Again, IYHO. Please do not think that your opinion is the same as others
on this list or even R Core.

> 
> > But if it works
> > as documented 
> 
> in many cases, the documentation is insufficient, confusing, and
> unhelpful when it comes to this sort of what you might call 'optimizations'.

So where are your patches? You write a lot of critique on this list but
I haven't see any public patches offered on R-Devel from you.

> 
> > then I don't see what the problem is unless i) you are
> > offering to rewrite the code to make it "work better", ii) that R Core
> > thinks any proposal "works better" and iii) in doing so it doesn't break
> > most of the R code out there in R itself or in add-on packages.
> >   
> 
> i'd prefer r to work better rather than "work better".  i'm afraid that
> serious improvements to r must, by necessity, break quite a lot of
> earlier code, which exploits, if only due the impossibility of not doing
> so, such design.

Again, IYHO. A lot of us have worked with/around infelicities of design
and don't want R breaking all of our code because it was "fixed". Now,
all the source is there and you could go and make the changes you want
and even release is if you see fit. No-one is stopping you.

> 
> it certainly is a good idea to offer to contribute and i'd be happy to
> do so, but i wouldn't be given a chance, i suppose.
> besides, i try not to imagine what hides under the surface of a language
> with such a design.

Have you tried? But bear in mind that R Core has more to balance that
just whether you think a design "flaw" or infelicity etc should be fixed
when it decides whether to accept patches.

G

> 
> vQ
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%



More information about the R-help mailing list