[Rd] Increase transparency: suggestion on how to avoid namespaces and/or unnecessary overwrites of existing functions

Sat Oct 1 18:11:27 CEST 2011

On Tue, Aug 23, 2011 at 2:23 PM, Janko Thyson
<janko.thyson.rstuff at googlemail.com> wrote:
> aDear list,
>
> I'm aware of the fact that I posted on something related a while ago, but I
> just can't sweat this off and would like to ask your for an opinion:
>
> The problem:
> Namespaces are great, but they don't resolve certain conflicts regarding
> name clashes. There are more and more people out there trying to come up
> with their own R packages, which is great also! Yet, it becomes more and
> more likely that programmers will choose identical names for their exported
> functions and/or that they add functionality to existing function (i.e.
> overwriting existing functions).
> The whole process of which packages overwrite which functions is somewhat
> obscure and in addition depends on their order in the search path. On the
> other hand, it is not possible to use "namespace" functionality (i.e.
> 'namespace::fun()'; also less efficient than direct call; see illustration
> below) during early stages of the development process (i.e. the package is
> not finished yet) as there is no namespace available yet.
>
> I know of at least two cases where such overwrites (I think it's called
> masking, right?) led to some confusion at our chair:
> 1) loading package forecast overwrites certain functions in stats which made
> some code refactoring necessary
> 2) loading package 'R.utils' followed by package 'roxygen' overwrites
> 'parse.default()' which results in errors for something like
> 'eval(parse(text="a <- 1"))' ; see illustration below)
> And I'm sure the community could come up with lots more of such scenarios.
>
> Suggestions:
> 1) In order to avoid name clashes/unintended overwrites, how about switching
> to a coding paradigm that explicitly (and automatically) includes a
> package's name in all its functions' names once code is turned into a real
> package? E.g., getting used to "preemptively" type 'package_fun()' or
> 'package.fun()' instead of just 'fun()'. Better to be save than sorry,
> right? This could be realized pretty easily (see example below) and, IMHO,
> would significantly increase transparency.
> 2) In order to avoid intended (but for the user often pretty obscure)
> overwrites of existing functions, we could use the same mechanism together
> with the "rule": just don't provide any functions that overwrite existing
> ones, rather prepend your version of that function with your package name
> and leave it up to the user which version he wants to call.

Experts from the Lisp-Stats community have added a number
of functions to R that were inspired by Lisp, but one feature that apparently
was not added is the shadowing feature of Common Lisp. Here the default
behavior is not to permit packages to import conflicting names unless
explicit shadowing directives are specified.

Arguably a package is not intended to be used like a callable library,
yet this is the way they are often used in the R context. This kind of
shadowing tool might help to make this practice safer, at the expense
of requiring the developer to specify explicit shadowing directives.

Dominick

> At the moment, all of this is probably not that big of a deal yet, but my
> suggestion has more of a mid-term/long-term character.
>
> Below you find a little illustration. I'm probably asking too much, but it'd
> be great if we could get a little discussion going on how to improve the way
> of loading packages!
>
> Best regards and thanks for R and all it's packages!
> Janko
>
> ################################################################################
> # PROOF OF CONCEPT
> ################################################################################
>
> # 1) PROBLEM
> # IMHO, with the number of packages submitted to CRAN constantly increasing,
> # over time we will be likely to see problems with respect to name clashes.
> # The main reasons I see for this are the following:
> # a) package developers picking identical names for their exported functions
> # b) package developers overwriting base functions in order to add
> functionality
> #    to existing functions
> # c) ...
> #
> # This can create scenarios in which the user might not exactly know that
> # he/she is using a 'modified' version of a specific function. More so, the
> user
> # needs to carefully read the description of each new package he plans
> # to use in order to find out which functions are exported and which
> existing
> # functions might be overwritten. This in turn might imply that the user's
> # existing code needs to be refactored (i.e. instead of using 'fun()' it
> # might now be necessary to type 'namespace::fun()' to be sure that the
> desired
> # function is called).
>
> # 2) SUGGESTED SOLUTION
> # That being said, why don't we switch to a 'preemptive' coding paradigm
> # where the default way of calling functions includes the specification of
> # its namespace? In principle, the functionality offered by
> 'namespace::fun()'
> # gets the job done.
> # BUT:
> # a) it is slower compared to the direct way of calling a function.
> #    (see illustration below).
> # b) this option is not available througout the development process of a
> package
> #    as there is no namespace yet and there's no way to emulate one. This in
> #    turn means that even though a package developer would buy into strictly
> #    using 'mypkg::fun()' throughout his package code, he can only do so at
> the
> #    very final stage of the process RIGHT before turning his code into a
> #    working package (when he's absolutely sure everything is working as
> planned).
> #    For debugging he would need to go back to using 'fun()'. Pretty
> cumbersome.
>
> # So how about simply automatically prepending a given function's name with
> # the package's name for each package that is build (e.g. 'pkg.fun()' or
> # 'pkg_fun()')? In the end, this would just be a small change for new
> packages
> # without a significant decrease of performance and it could also be
> realized
> # at early stages of the development process (see illustration below).
>
> # 3) ILLUSTRATION
>
> # Example case where base function 'parse.default' is overwritten:
> parse(text="a <- 5")    # Works
> require(R.utils)
> require(roxygen)
> parse(text="a <- 5")    # Does not work anymore
>
> ################# START A NEW R SESSION BEFORE YOU CONTINUE
> ####################
>
> # Inefficiency of 'namespace::fun()':
> require(microbenchmark)
> res.a <- microbenchmark(eval(parse(text="a <- 5")))
> res.b <- microbenchmark(eval(base::parse(text="a <- 5")))
> median(res.a$time)/median(res.b$time)
>
> # Can be made up by explicit assignment:
> foo <- base::parse
> res.a <- microbenchmark(eval(parse(text="a <- 5")))
> res.b <- microbenchmark(eval(foo(text="a <- 5")))
> median(res.a$time)/median(res.b$time)
>
> # Automatically prepend function names:
> processNamespaces <- function(
>    do.global=FALSE,
>    do.verbose=FALSE,
>    .delim.name="_",
>    ...
> ){
>    srch.list.0 <- search()
>    srch.list <- gsub("package:", "", srch.list.0)
>    if(!do.global){
>        assign(".NS", new.env(), envir=.GlobalEnv)
>    }
>    out <- lapply(1:length(srch.list), function(x.pkg){
>        pkg <- srch.list[x.pkg]
>
>        # SKIP LIST
>        if(pkg %in% c(".GlobalEnv", "Autoloads")){
>            return(NULL)
>        }
>        # /
>
>        # TARGET ENVIR
>        if(!do.global){
>            # ADD PACKAGE TO .NS ENVIRONMENT
>            envir <- eval(substitute(
>                assign(PKG, new.env(), envir=.NS),
>                list(PKG=pkg)
>            ))
>            # /
> #            envir <- get(pkg, envir=.NS, inherits=FALSE)
>            envir.msg <- paste(".NS$", pkg, sep="")
>        } else {
>            envir <- .GlobalEnv
>            envir.msg <- ".GlobalEnv"
>        }
>        # /
>
>        # PROCESS FUNCTIONS
>        cnt <- ls(pos=x.pkg)
>        out <- unlist(sapply(cnt, function(x.cnt){
>            value <- get(x.cnt, pos=x.pkg, inherits=FALSE)
>            obj.mod <- paste(pkg, x.cnt, sep=.delim.name)
>            if(!is.function(value)){
>                return(NULL)
>            }
>            if(do.verbose){
>                cat(paste("Assigning '", obj.mod, "' to '", envir.msg,
>                    "'", sep=""), sep="\n")
>            }
>            eval(substitute(
>                assign(OBJ.MOD, value, envir=ENVIR),
>                list(
>                    OBJ.MOD=obj.mod,
>                    ENVIR=envir
>                )
>            ))
>            return(obj.mod)
>        }))
>        names(out) <- NULL
>        # /
>        return(out)
>    })
>    names(out) <- srch.list
>    return(out)
> }
>
> # +++++
>
> funs <- processNamespaces(do.verbose=TRUE)
> ls(.NS)
> ls(.NS$base)
> .NS$base$base_parse
>
> res.a <- microbenchmark(eval(parse(text="a <- 5")))
> res.b <- microbenchmark(eval(.NS$base$base_parse(text="a <- 5")))
> median(res.a$time)/median(res.b$time)
>
> #+++++
>
> funs <- processNamespaces(do.global=TRUE, do.verbose=TRUE)
> base_parse
>
> res.a <- microbenchmark(eval(parse(text="a <- 5")))
> res.b <- microbenchmark(eval(base_parse(text="a <- 5")))
> median(res.a$time)/median(res.b$time)
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>