[Rd] Increase transparency: suggestion on how to avoid namespaces and/or unnecessary overwrites of existing functions

Simon Urbanek simon.urbanek at r-project.org
Sun Oct 2 01:45:18 CEST 2011


On Oct 1, 2011, at 6:14 PM, Duncan Murdoch wrote:

> On 11-10-01 5:14 PM, Dominick Samperi wrote:
>> On Sat, Oct 1, 2011 at 1:08 PM, Duncan Murdoch<murdoch.duncan at gmail.com>  wrote:
>>> On 11-08-23 2:23 PM, Janko Thyson wrote:
>>>> 
>>>> aDear list,
>>>> 
>>>> I'm aware of the fact that I posted on something related a while ago,
>>>> but I just can't sweat this off and would like to ask your for an opinion:
>>>> 
>>>> The problem:
>>>> Namespaces are great, but they don't resolve certain conflicts regarding
>>>> name clashes. There are more and more people out there trying to come up
>>>> with their own R packages, which is great also! Yet, it becomes more and
>>>> more likely that programmers will choose identical names for their
>>>> exported functions and/or that they add functionality to existing
>>>> function (i.e. overwriting existing functions).
>>>> The whole process of which packages overwrite which functions is
>>>> somewhat obscure and in addition depends on their order in the search
>>>> path. On the other hand, it is not possible to use "namespace"
>>>> functionality (i.e. 'namespace::fun()'; also less efficient than direct
>>>> call; see illustration below) during early stages of the development
>>>> process (i.e. the package is not finished yet) as there is no namespace
>>>> available yet.
>>>> 
>>> 
>>> I agree there can be a problem, but I don't think it is necessarily as
>>> serious as you suggest.  Even though there are more and more packages
>>> available, most people will still use roughly the same number of them. Just
>>> because CRAN has thousands of packages doesn't mean I use all of them at the
>>> same time.
>>> 
>>> 
>>>> I know of at least two cases where such overwrites (I think it's called
>>>> masking, right?) led to some confusion at our chair:
>>>> 1) loading package forecast overwrites certain functions in stats which
>>>> made some code refactoring necessary
>>> 
>>> If your code had been in a package with a NAMESPACE, it would not have been
>>> affected by a user loading forecast.  (If you start importing it, then of
>>> course it could cause masking problems.)
>>> 
>>> You suggest above that users only put code into a package very late in the
>>> development process.  The solution is, don't do that.  Create a package
>>> early on, and use it through the majority of development time.
>>> 
>>> You can leave the choice of exports until late by exporting everything;
>>> you'll still get the benefit of the more controlled name search from the
>>> beginning.
>>> 
>>> You say you can't use "namespace::call()" until the namespace package has
>>> been written.  But why would you want to?  If the call is coming from the
>>> new package, objects in it will be used with first priority in resolving the
>>> call.  You only need the :: notation when there are ambiguities in calls to
>>> external packages.
>>> 
>>> 
>>>> 2) loading package 'R.utils' followed by package 'roxygen' overwrites
>>>> 'parse.default()' which results in errors for something like
>>>> 'eval(parse(text="a<- 1"))' ; see illustration below)
>>>> And I'm sure the community could come up with lots more of such scenarios.
>>>> 
>>>> Suggestions:
>>>> 1) In order to avoid name clashes/unintended overwrites, how about
>>>> switching to a coding paradigm that explicitly (and automatically)
>>>> includes a package's name in all its functions' names once code is
>>>> turned into a real package? E.g., getting used to "preemptively" type
>>>> 'package_fun()' or 'package.fun()' instead of just 'fun()'. Better to be
>>>> save than sorry, right? This could be realized pretty easily (see
>>>> example below) and, IMHO, would significantly increase transparency.
>>> 
>>> I think long names with consistent prefixes are harder to read than short
>>> descriptive names.  I think this would make code harder to read. For
>>> example, the first few lines of mean.default would change from
>>> 
>>>    if (!is.numeric(x)&&  !is.complex(x)&&  !is.logical(x)) {
>>>        warning("argument is not numeric or logical: returning NA")
>>>        return(NA_real_)
>>>    }
>>> 
>>> to
>>> 
>>>    if (!base_is.numeric(x)&&  !base_is.complex(x)&&
>>>        !base_is.logical(x)) {
>>>        base_warning("argument is not numeric or logical: returning NA")
>>>        return(base_NA_real_)
>>>    }
>>> 
>>> 
>>>> 2) In order to avoid intended (but for the user often pretty obscure)
>>>> overwrites of existing functions, we could use the same mechanism
>>>> together with the "rule": just don't provide any functions that
>>>> overwrite existing ones, rather prepend your version of that function
>>>> with your package name and leave it up to the user which version he
>>>> wants to call.
>>> 
>>> That seems like good advice.
>>> 
>>> Duncan Murdoch
>> 
>> Except that namespace::foo should be assigned to another local
>> variable instead of using package::foo in a tight loop, because
>> repeated calls to "::" can introduce a significant performance
>> penalty. (This has been discussed in another thread.)
> 
> That's good advice too.
> 

Well, if this idiom is used more often, we may re-consider the implementation of `::` - it could definitely be sped up if needed, but I think so far the view was that it's not worth the hassle (aka increased complexity and relative ugliness of making it a special form) ...

Cheers,
Simon



> Duncan Murdoch
> 
>> 
>>>> 
>>>> At the moment, all of this is probably not that big of a deal yet, but
>>>> my suggestion has more of a mid-term/long-term character.
>>>> 
>>>> Below you find a little illustration. I'm probably asking too much, but
>>>> it'd be great if we could get a little discussion going on how to
>>>> improve the way of loading packages!
>>>> 
>>>> Best regards and thanks for R and all it's packages!
>>>> Janko
>>>> 
>>>> 
>>>> ################################################################################
>>>> # PROOF OF CONCEPT
>>>> 
>>>> ################################################################################
>>>> 
>>>> # 1) PROBLEM
>>>> # IMHO, with the number of packages submitted to CRAN constantly
>>>> increasing,
>>>> # over time we will be likely to see problems with respect to name
>>>> clashes.
>>>> # The main reasons I see for this are the following:
>>>> # a) package developers picking identical names for their exported
>>>> functions
>>>> # b) package developers overwriting base functions in order to add
>>>> functionality
>>>> #    to existing functions
>>>> # c) ...
>>>> #
>>>> # This can create scenarios in which the user might not exactly know that
>>>> # he/she is using a 'modified' version of a specific function. More so,
>>>> the user
>>>> # needs to carefully read the description of each new package he plans
>>>> # to use in order to find out which functions are exported and which
>>>> existing
>>>> # functions might be overwritten. This in turn might imply that the user's
>>>> # existing code needs to be refactored (i.e. instead of using 'fun()' it
>>>> # might now be necessary to type 'namespace::fun()' to be sure that the
>>>> desired
>>>> # function is called).
>>>> 
>>>> # 2) SUGGESTED SOLUTION
>>>> # That being said, why don't we switch to a 'preemptive' coding paradigm
>>>> # where the default way of calling functions includes the specification of
>>>> # its namespace? In principle, the functionality offered by
>>>> 'namespace::fun()'
>>>> # gets the job done.
>>>> # BUT:
>>>> # a) it is slower compared to the direct way of calling a function.
>>>> #    (see illustration below).
>>>> # b) this option is not available througout the development process of a
>>>> package
>>>> #    as there is no namespace yet and there's no way to emulate one.
>>>> This in
>>>> #    turn means that even though a package developer would buy into
>>>> strictly
>>>> #    using 'mypkg::fun()' throughout his package code, he can only do so
>>>> at the
>>>> #    very final stage of the process RIGHT before turning his code into a
>>>> #    working package (when he's absolutely sure everything is working as
>>>> planned).
>>>> #    For debugging he would need to go back to using 'fun()'. Pretty
>>>> cumbersome.
>>>> 
>>>> # So how about simply automatically prepending a given function's name
>>>> with
>>>> # the package's name for each package that is build (e.g. 'pkg.fun()' or
>>>> # 'pkg_fun()')? In the end, this would just be a small change for new
>>>> packages
>>>> # without a significant decrease of performance and it could also be
>>>> realized
>>>> # at early stages of the development process (see illustration below).
>>>> 
>>>> # 3) ILLUSTRATION
>>>> 
>>>> # Example case where base function 'parse.default' is overwritten:
>>>> parse(text="a<- 5")    # Works
>>>> require(R.utils)
>>>> require(roxygen)
>>>> parse(text="a<- 5")    # Does not work anymore
>>>> 
>>>> ################# START A NEW R SESSION BEFORE YOU CONTINUE
>>>> ####################
>>>> 
>>>> # Inefficiency of 'namespace::fun()':
>>>> require(microbenchmark)
>>>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>>>> res.b<- microbenchmark(eval(base::parse(text="a<- 5")))
>>>> median(res.a$time)/median(res.b$time)
>>>> 
>>>> # Can be made up by explicit assignment:
>>>> foo<- base::parse
>>>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>>>> res.b<- microbenchmark(eval(foo(text="a<- 5")))
>>>> median(res.a$time)/median(res.b$time)
>>>> 
>>>> # Automatically prepend function names:
>>>> processNamespaces<- function(
>>>>      do.global=FALSE,
>>>>      do.verbose=FALSE,
>>>>      .delim.name="_",
>>>>      ...
>>>> ){
>>>>      srch.list.0<- search()
>>>>      srch.list<- gsub("package:", "", srch.list.0)
>>>>      if(!do.global){
>>>>          assign(".NS", new.env(), envir=.GlobalEnv)
>>>>      }
>>>>      out<- lapply(1:length(srch.list), function(x.pkg){
>>>>          pkg<- srch.list[x.pkg]
>>>> 
>>>>          # SKIP LIST
>>>>          if(pkg %in% c(".GlobalEnv", "Autoloads")){
>>>>              return(NULL)
>>>>          }
>>>>          # /
>>>> 
>>>>          # TARGET ENVIR
>>>>          if(!do.global){
>>>>              # ADD PACKAGE TO .NS ENVIRONMENT
>>>>              envir<- eval(substitute(
>>>>                  assign(PKG, new.env(), envir=.NS),
>>>>                  list(PKG=pkg)
>>>>              ))
>>>>              # /
>>>> #            envir<- get(pkg, envir=.NS, inherits=FALSE)
>>>>              envir.msg<- paste(".NS$", pkg, sep="")
>>>>          } else {
>>>>              envir<- .GlobalEnv
>>>>              envir.msg<- ".GlobalEnv"
>>>>          }
>>>>          # /
>>>> 
>>>>          # PROCESS FUNCTIONS
>>>>          cnt<- ls(pos=x.pkg)
>>>>          out<- unlist(sapply(cnt, function(x.cnt){
>>>>              value<- get(x.cnt, pos=x.pkg, inherits=FALSE)
>>>>              obj.mod<- paste(pkg, x.cnt, sep=.delim.name)
>>>>              if(!is.function(value)){
>>>>                  return(NULL)
>>>>              }
>>>>              if(do.verbose){
>>>>                  cat(paste("Assigning '", obj.mod, "' to '", envir.msg,
>>>>                      "'", sep=""), sep="\n")
>>>>              }
>>>>              eval(substitute(
>>>>                  assign(OBJ.MOD, value, envir=ENVIR),
>>>>                  list(
>>>>                      OBJ.MOD=obj.mod,
>>>>                      ENVIR=envir
>>>>                  )
>>>>              ))
>>>>              return(obj.mod)
>>>>          }))
>>>>          names(out)<- NULL
>>>>          # /
>>>>          return(out)
>>>>      })
>>>>      names(out)<- srch.list
>>>>      return(out)
>>>> }
>>>> 
>>>> # +++++
>>>> 
>>>> funs<- processNamespaces(do.verbose=TRUE)
>>>> ls(.NS)
>>>> ls(.NS$base)
>>>> .NS$base$base_parse
>>>> 
>>>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>>>> res.b<- microbenchmark(eval(.NS$base$base_parse(text="a<- 5")))
>>>> median(res.a$time)/median(res.b$time)
>>>> 
>>>> #+++++
>>>> 
>>>> funs<- processNamespaces(do.global=TRUE, do.verbose=TRUE)
>>>> base_parse
>>>> 
>>>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>>>> res.b<- microbenchmark(eval(base_parse(text="a<- 5")))
>>>> median(res.a$time)/median(res.b$time)
>>>> 
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 



More information about the R-devel mailing list