[R] (structured) programming style

Fri Sep 12 02:40:47 CEST 2003

On Thu, 11 Sep 2003, Ross Boylan wrote:

> I find that because R functions are call by value, and because there are
> no pointer or reference types (a la C++), I am making fairly heavy use
> of lexical scoping to modify variables.  E.g.
> outer <- function() {
>   m <- matrix(0, 2, 2)
>   inner <- function() {
>     m[2,2] <<- 3
>    ...
>    }
> }
>
> I am not too pleased with this, as it violates basic rules of structured
> programming, namely that it is not obvious what variables inner is
> reading or writing.  It's not as totally out of control as the use of
> global variables, but it's still bothersome.  In practice, I have many
> variables and several levels of nesting that come into play.
>
> A slightly subtler problem is that some of the variables in outer are
> just for use by outer, while others are used for communication down the
> line.  One can't tell by quick inspection what's what.
>
> I am trying to compensate by commenting the code heavily, but I'd rather
> not use a style that makes that necessary.
>
> I recognize that I could pass m as an argument to inner and return a
> modified version of it.  Assuming more than one variable was involved
> (as would usually be the case) I'd need to put the "new" m in a list
> returned from inner, and then unpack the list in the outer function.
> This is not only rather ugly, but I imagine it also raises some
> performance issues.
>

My personal preference is to use lexical scope only downwards (with a few
exceptions). Passing a variable into a function implicitly seems harmless,
but passing it out implicitly is, as you note, confusing.  In this context
I note that Python 2.1 has lexical closures that only allow variables to
be passed in and not out, and that most people seem happy with this.

I would typically pass out a list, but often wouldn't bother to unpack it
[One could make a case for multiple value return to be added to R, but
it's never been a high enough priority for the effort it would take]

There isn't necessarily better performance with the way you are doing it
(though for matrices there probably will be). Copying generally happens
when an object is modified, not when a name is bound.

For example, if you do

>  a<-list(m=rnorm(1e6))
> gc()
          used (Mb) gc trigger (Mb)
Ncells  369126  9.9     667722 17.9
Vcells 1087453  8.3    1471520 11.3
> b<-a$m
> gc()
          used (Mb) gc trigger (Mb)
Ncells  369103  9.9     667722 17.9
Vcells 1087396  8.3    1584906 12.1
> b[1]<-2
> gc()
          used (Mb) gc trigger (Mb)
Ncells  369108  9.9     667722 17.9
Vcells 2087397 16.0    2471462 18.9

you see that unpacking b from a didn't result in a copy, and that b must
just be a reference to a$m.  When b is modified it must be copied, but
this is true whether or not it is in a list. What matters is whether there
is another reference to it somewhere [actually, whether R thinks there
*might* be another reference: we try to be a bit conservative about this].

Now, it is certainly possible that you could have a situation where
assigning with <<- was really faster than passing back a list, by enough
to matter.  I think this situation is unusual enough that there may not be
a firm idea of `good R style', since it assumes that the objects are small
enough to fit easily in memory but large enough that it's worth going to
some effort to reduce copying. You might get more useful input from the
Bioconductor list, where people tend to spend a lot of time doing
computationally expensive things to medium-sized data sets.

	-thomas