[Rd] vctrs: a type system for the tidyverse

Hadley Wickham h@wickh@m @ending from gm@il@com
Wed Aug 8 16:34:42 CEST 2018


>>> Method dispatch for `vec_c()` is quite simple because associativity and
>>> commutativity mean that we can determine the output type only by
>>> considering a pair of inputs at a time. To this end, vctrs provides
>>> `vec_type2()` which takes two inputs and returns their common type
>>> (represented as zero length vector):
>>>
>>>     str(vec_type2(integer(), double()))
>>>     #>  num(0)
>>>
>>>     str(vec_type2(factor("a"), factor("b")))
>>>     #>  Factor w/ 2 levels "a","b":
>>
>>
>> What is the reasoning behind taking the union of the levels here? I'm not
>> sure that is actually the behavior I would want if I have a vector of
>> factors and I try to append some new data to it. I might want/ expect to
>> retain the existing levels and get either NAs or an error if the new data
>> has (present) levels not in the first data. The behavior as above doesn't
>> seem in-line with what I understand the purpose of factors to be (explicit
>> restriction of possible values).
>
> Originally (like a week ago 😀), we threw an error if the factors
> didn't have the same level, and provided an optional coercion to
> character. I decided that while correct (the factor levels are a
> parameter of the type, and hence factors with different levels aren't
> comparable), that this fights too much against how people actually use
> factors in practice. It also seems like base R is moving more in this
> direction, i.e. in 3.4 factor("a") == factor("b") is an error, whereas
> in R 3.5 it returns FALSE.

I now have a better argument, I think:

If you squint your brain a little, I think you can see that each set
of automatic coercions is about increasing resolution. Integers are
low resolution versions of doubles, and dates are low resolution
versions of date-times. Logicals are low resolution version of
integers because there's a strong convention that `TRUE` and `FALSE`
can be used interchangeably with `1` and `0`.

But what is the resolution of a factor? We must take a somewhat
pragmatic approach because base R often converts character vectors to
factors, and we don't want to be burdensome to users. So we say that a
factor `x` has finer resolution than factor `y` if the levels of `y`
are contained in `x`. So to find the common type of two factors, we
take the union of the levels of each factor, given a factor that has
finer resolution than both. Finally, you can think of a character
vector as a factor with every possible level, so factors and character
vectors are coercible.

(extracted from the in-progress vignette explaining how to extend
vctrs to work with your own vctrs, now that vctrs has been rewritten
to use double dispatch)

Hadley

-- 
http://hadley.nz



More information about the R-devel mailing list