[Rd] vctrs: a type system for the tidyverse

Hadley Wickham h@wickh@m @ending from gm@il@com
Mon Aug 6 18:21:09 CEST 2018


Hi all,

I wanted to share with you an experimental package that I’m currently
working on: vctrs, <https://github.com/r-lib/vctrs>. The motivation for
vctrs is to think deeply about the output “type” of functions like
`c()`, `ifelse()`, and `rbind()`, with an eye to implementing one
strategy throughout the tidyverse (i.e. all the functions listed at
<https://github.com/r-lib/vctrs#tidyverse-functions>). Because this is
going to be a big change, I thought it would be very useful to get
comments from a wide audience, so I’m reaching out to R-devel to get
your thoughts.

There is quite a lot already in the readme
(<https://github.com/r-lib/vctrs#vctrs>), so here I’ll try to motivate
vctrs as succinctly as possible by comparing `base::c()` to its
equivalent `vctrs::vec_c()`. I think the drawbacks of `c()` are well
known, but to refresh your memory, I’ve highlighted a few at
<https://github.com/r-lib/vctrs#compared-to-base-r>. I think they arise
because of two main challenges: `c()` has to both combine vectors *and*
strip attributes, and it only dispatches on the first argument.

The design of vctrs is largely driven by a pair of principles:

-   The type of `vec_c(x, y)` should be the same as `vec_c(y, x)`

-   The type of `vec_c(x, vec_c(y, z))` should be the same as
    `vec_c(vec_c(x, y), z)`

i.e. the type should be associative and commutative. I think these are
good principles because they makes types simpler to understand and to
implement.

Method dispatch for `vec_c()` is quite simple because associativity and
commutativity mean that we can determine the output type only by
considering a pair of inputs at a time. To this end, vctrs provides
`vec_type2()` which takes two inputs and returns their common type
(represented as zero length vector):

    str(vec_type2(integer(), double()))
    #>  num(0)

    str(vec_type2(factor("a"), factor("b")))
    #>  Factor w/ 2 levels "a","b":

    # NB: not all types have a common/unifying type
    str(vec_type2(Sys.Date(), factor("a")))
    #> Error: No common type for date and factor

(`vec_type()` currently implements double dispatch through a combination
of S3 dispatch and if-else blocks, but this will change to a pure S3
approach in the near future.)

To find the common type of multiple vectors, we can use `Reduce()`:

    vecs <- list(TRUE, 1:10, 1.5)

    type <- Reduce(vec_type2, vecs)
    str(type)
    #>  num(0)

There’s one other piece of the puzzle: casting one vector to another
type. That’s implemented by `vec_cast()` (which also uses double
dispatch):

    str(lapply(vecs, vec_cast, to = type))
    #> List of 3
    #>  $ : num 1
    #>  $ : num [1:10] 1 2 3 4 5 6 7 8 9 10
    #>  $ : num 1.5

All up, this means that we can implement the essence of `vec_c()` in
only a few lines:

    vec_c2 <- function(...) {
      args <- list(...)
      type <- Reduce(vec_type, args)

      cast <- lapply(type, vec_cast, to = type)
      unlist(cast, recurse = FALSE)
    }

    vec_c(factor("a"), factor("b"))
    #> [1] a b
    #> Levels: a b

    vec_c(Sys.Date(), Sys.time())
    #> [1] "2018-08-06 00:00:00 CDT" "2018-08-06 11:20:32 CDT"

(The real implementation is little more complex:
<https://github.com/r-lib/vctrs/blob/master/R/c.R>)

On top of this foundation, vctrs expands in a few different ways:

-   To consider the “type” of a data frame, and what the common type of
    two data frames should be. This leads to a natural implementation of
    `vec_rbind()` which includes all columns that appear in any input.

-   To create a new “list\_of” type, a list where every element is of
    fixed type (enforced by `[<-`, `[[<-`, and `$<-`)

-   To think a little about the “shape” of a vector, and to consider
    recycling as part of the type system. (This thinking is not yet
    fully fleshed out)

Thanks for making it to the bottom of this long email :) I would love to
hear your thoughts on vctrs. It’s something that I’ve been having a lot
of fun exploring, and I’d like to make sure it is as robust as possible
(and the motivations are as clear as possible) before we start using it
in other packages.

Hadley


-- 
http://hadley.nz



More information about the R-devel mailing list