[Rd] Inefficiency in df$col

Radford Neal r@d|ord @end|ng |rom c@@toronto@edu
Sun Feb 3 18:04:55 CET 2019


While doing some performance testing with the new version of pqR (see
pqR-project.org), I've encountered an extreme, and quite unnecessary,
inefficiency in the current R Core implementation of R, which I think
you might want to correct.

The inefficiency is in access to columns of a data frame, as in
expressions such as df$col[i], which I think are very common (the
alternatives of df[i,"col"] and df[["col"]][i] are, I think, less
common).

Here is the setup for an example showing the issue:

  L <- list (abc=1:9, xyz=11:19)
  Lc <- L; class(Lc) <- "glub"
  df <- data.frame(L)

And here are some times for R-3.5.2 (r-devel of 2019-02-01 is much
the same):

  > system.time (for (i in 1:1000000) r <- L$xyz)
     user  system elapsed 
    0.086   0.004   0.089 
  > system.time (for (i in 1:1000000) r <- Lc$xyz)
     user  system elapsed 
    0.494   0.000   0.495 
  > system.time (for (i in 1:1000000) r <- df$xyz)
     user  system elapsed 
    3.425   0.000   3.426 

So accessing a column of a data frame is 38 times slower than
accessing a list element (which is what happens in the underlying
implementation of a data frame), and 7 times slower than accessing an
element of a list with a class attribute (for which it's necessary to
check whether there is a $.glub method, which there isn't here).

For comparison, here are the times for pqR-2019-01-25:

  > system.time (for (i in 1:1000000) r <- L$xyz)
     user  system elapsed 
    0.057   0.000   0.058 
  > system.time (for (i in 1:1000000) r <- Lc$xyz)
     user  system elapsed 
    0.251   0.000   0.251 
  > system.time (for (i in 1:1000000) r <- df$xyz)
     user  system elapsed 
    0.247   0.000   0.247 

So when accessing df$xyz, R-3.5.2 is 14 times slower than pqR-2019-01-25.
(For a partial match, like df$xy, R-3.5.2 is 34 times slower.)

I wasn't surprised that pqR was faster, but I didn't expect this big a
difference.  Then I remembered having seen a NEWS item from R-3.1.0:

  * Partial matching when using the $ operator _on data frames_ now
    throws a warning and may become defunct in the future. If partial
    matching is intended, replace foo$bar by foo[["bar", exact =
    FALSE]].

and having looked at the code then:

  `$.data.frame` <- function(x,name) {
    a <- x[[name]]
    if (!is.null(a)) return(a)
  
    a <- x[[name, exact=FALSE]]
    if (!is.null(a)) warning("Name partially matched in data frame")
    return(a)
  }

I recall thinking at the time that this involved a pretty big
performance hit, compared to letting the primitive $ operator do it,
just to produce a warning.  But it wasn't until now that I noticed
this NEWS in R-3.1.1:

  * The warning when using partial matching with the $ operator on
    data frames is now only given when
    options("warnPartialMatchDollar") is TRUE.

for which the code was changed to:

  `$.data.frame` <- function(x,name) {
    a <- x[[name]]
    if (!is.null(a)) return(a)
  
    a <- x[[name, exact=FALSE]]
    if (!is.null(a) && getOption("warnPartialMatchDollar", default=FALSE)) {
          names <- names(x)
          warning(gettextf("Partial match of '%s' to '%s' in data frame",
                                     name, names[pmatch(name, names)]))
    }
    return(a)
  }

One can see the effect now when warnPartialMatchDollar is enabled:

  > options(warnPartialMatchDollar=TRUE)
  > Lc$xy
  [1] 11 12 13 14 15 16 17 18 19
  Warning message:
  In Lc$xy : partial match of 'xy' to 'xyz'
  > df$xy
  [1] 11 12 13 14 15 16 17 18 19
  Warning message:
  In `$.data.frame`(df, xy) : Partial match of 'xy' to 'xyz' in data frame

So the only thing that slowing down acesses like df$xyz by a factor of
seven achieves now is to add the words "in data frame" to the warning
message (while making the earlier part of the message less intelligible).

I think you might want to just delete the definition of $.data.frame,
reverting to the situation before R-3.1.0.

   Radford Neal



More information about the R-devel mailing list