[R] Function to modify existing data.frame--Improving R Language

Sun Jan 30 21:01:37 CET 2005

On Wed, Jan 19, 2005 at 10:31:13AM -0800, Thomas Lumley wrote:
> On Wed, 19 Jan 2005, Peter Muhlberger wrote:
> 
> >						 By not allowing any
> >straightforward passing by reference, R strikes me as a lot less 
> >flexible &
> >useful than it might be.  A basic operation in other stats languages is 
> >to
> >update a dataset using a program.  This proves very helpful for managing
> >data and setting up analyses.  But, this seems to be quite inelegant to 
> >do
> >in R.
> 
> I don't see why
>     mydata <- some.program(mydata)
> is much less elegant than
>     mydata.someProgram()
> as a way of updating a data set. It may use more memory, but that wasn't 
> the point at issue.

Somehow, this remark has stuck around in my head, and after a week
or so of pondering, I'd like to share two thoughts on this. These are
on a rather conceptual level, please skip this message if that's not
interesting to you.

(1) Regarding the net effect, these two are equivalent, but the notion
of "elegance" often differentiates between things which, with regard
to some given purpose, are equivalent. My interpretation of elegance in
this case is that it is "elegant" to have a bijection between instances
of objects in a program and entities in reality. This is realized by
"mydata <- some.program(mydata)" less than by "mydata.someProgram()",
assuming that the former operates on a copy of mydata and the latter on
a reference (i.e. "this" in Java).

The problem with "elegance" is, of course that it is a rather subjective
criterion which may well turn out to be irrelevant as one tries to
explain it objectively and formalize it. But sometimes, a discontent
that initially manifests itself as a perception of "inelegance"
may point towards deeper reaching issues, and this may be the case
here. The principle of a one-to-one relation between entities and their
representations in a representation scheme is quite fundamental. The first
normal form in relational database design reflects this, for example.

(2) The information provided by the syntax
"mydata <- some.program(mydata)" is sufficient to allow an R interpreter
to recognize that a call by reference could be used here, and
interestingly, it seems to me that the lazy evaluation mechanism already
provides most of what is needed to pull this off.

As far as I understand, the promise generated from a variable being used
as a parameter resolves to a sort of reference to the variable upon
being forced. Differently from a normal variable, a promise cannot be
modified in place e.g. by assignment to one of its elements. Therefore,
doing so triggers construction of a local variable, e.g.  in

    some.program <- function(x)
    {
      x[7] <- 4711;
      x;
    }

the line "x[7] <- 4711" results in copying the value of the promise
x, replacing the 7th element of the copy with 4711, and assigning
the resulting vector to x, which thereby becomes a normal local
variable.

However, if the variable's value is going to be replaced by the
value returned by the function, this copying is not necessary. As
the old value is going to be scrapped anyway, the function can as
well use that as a local variable.

The R interpreter could actually exploit this "recycling potential":
Before actually executing a function call, such as some.program(mydata),
the interpreter would have to determine what the result is used for. If
the result is assigned to a variable, then the interpreter should check
whether the variable is also used as a parameter in the function call,
and if that is the case, the promise generated for the corresponding
parameter should be flagged "no copying necessary".

On a technical level, I have no idea how easy or difficult realizing
this idea would be, as I haven't delved the R source and the docs
don't provide too much details about the internal workings of promises.
Furthermore, I would expect that the interpreter currently executes
function calls first and deals with using the results, e.g. by assigning
them to something, only after they are returned.

But perhaps, the more interesting perspective of this is on the
conceptual level, namely that, in fact, the assignment variant
is not per se less "elegant" than the method call variant -- it all
depends on how "elegantly" the interpreter does it.

> Of course there are advantages to the ability to pass by reference, and 
> disadvantages -- the most obvious disadvantage is that it is not easy to 
> tell which variables are modified by a given piece of code.

Yes -- this is something I always disliked about C++, whereas in C,
the fact that a pointer is passed in is reasonably obvious from the call
itself, i.e. without consulting the function declaration or the docs. As
an idea, making possible modification of parameter variables obvious in
R could perhaps look something like

    mydata <- some.program(make.promise(mydata, allow.modification = TRUE))

> It probably wouldn't be that hard to produce something that looked like 
> a data frame but was passed by reference, by wrapping it in a 
> environment.

This reminds me of Python, where everything is a dictionary...

Best regards, Jan
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |    *NEW*    email: jtk at cmp.uea.ac.uk                               |
 |    *NEW*    WWW:   http://www.cmp.uea.ac.uk/people/jtk             |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*