[R] Variable passed to function not used in function in select=... in subset

Tue Nov 11 16:41:12 CET 2008

> I think your analysis is correct, that the goals of casual use and
> programming are inconsistent.  But in general I think there's always going
> to be support for providing alternative ways that are programmer-safe.
>
> For instance, library( foo, character.only=TRUE) says that foo is a
> character vector, not the name of a package.  I don't know of anything that
> subset() provides that is not available in other ways (I think of it as
> purely a convenience function, and my first piece of advice to Karl was not
> to use it).

Good points - every function optimised for interactive use should have
a companion that is optimised for programmatic use.

> However, if there really is something there, then it would be
> worthwhile pointing that out, and either modifying subset() to make it safe,
> or providing an alternative function.

When I teach subsetting I try to make this clear - using [ will always
work, there's no magic and everything is explicit.  subset() has more
magic which saves you typing, but occasionally the magic doesn't work
and you'll be left scratching your head as to why.  In my experience
students prefer subset() until they encounter strange behaviour that
they don't understand.

> I think this tension is a fundamental part of the character of S and R.  But
> it is also fundamental to R that there are QC tests that apply to code in
> packages:  so writing new tests that detect dangerous usage (e.g. to
> disallow partial name matching) would be another way to improve reliability.
>  Writing a test for misuse of drop=TRUE seems quite hard, but there are
> probably ways a debugger could do it:  e.g. to tag the invocation as to
> whether any indices were dropped on the first call, and then warn if the
> result isn't the same on every subsequent call).

A similar thing would be to force package authors to explicitly
specify na.rm to ensure that they have thought about how to deal with
missing values (this always trips me up).  Perhaps you could treat
drop similarly - in non-interactive code drop should not have a
default value.   Presumably this wouldn't be too hard to implement - R
CMD check would just switch out [ for a version that didn't have a
default value, in a similar way to what happens with T and F (another
example of implicit interactive use vs. explicit programmatic use)

> Conceivably Karl's problem could be detected in the same way:  tag each name
> in the expression as to whether it was found in the data frame or some other
> environment, and then warn if that tag ever changes.  Or maybe the test
> should just warn that subset() is a convenience function, not meant for
> programming.

It would be nice if the documentation was clearer on these issues.  I
can imagine every function having a numeric value associated with it
which gave it's position on the interactive vs programming continuum.
Then you could sum up all the values in a function and warn the author
if it was too high.  Not very practical to implement though!

Hadley
-- 
http://had.co.nz/