[Rd] vctrs: a type system for the tidyverse

Iñaki Úcar i@uc@r86 @ending from gm@il@com
Wed Aug 8 19:47:38 CEST 2018


El mié., 8 ago. 2018 a las 19:23, Gabe Becker (<becker.gabe using gene.com>) escribió:
>
> Actually, I sent that too quickly, I should have let it stew a bit more.
> I've changed my mind about the resolution argument I Was trying to make.
> There is more information, technically speaking, in the factor with empty
> levels. I'm still not convinced that its the right behavior, personally. It
> may just be me though, since Martin seems on board. Mostly I'm just very
> wary of taking away the thing about factors that makes them fundamentally
> not characters, and removing the effectiveness of the level restriction, in
> practice, does that.

For what it's worth, I always thought about factors as fundamentally
characters, but with restrictions: a subspace of all possible strings.
And I'd say that a non-negligible number of R users may think about
them in a similar way.

In fact, if you search "concatenation factors", you'll see that back
in 2008 somebody asked on R-help [1] because he wanted to do exactly
what Hadley is describing (i.e., concatenation as character with
levels as a union of the levels), and he was surprised because...
well, the behaviour of c.factor is quite surprising if you don't read
the manual.

BTW, the solution proposed was unlist(list(fct1, fct2)).

[1] https://www.mail-archive.com/r-help@r-project.org/msg38360.html

Iñaki

>
> Best,
> ~G
>
> On Wed, Aug 8, 2018 at 8:54 AM, Martin Maechler <maechler using stat.math.ethz.ch>
> wrote:
>
> > >>>>> Hadley Wickham
> > >>>>>     on Wed, 8 Aug 2018 09:34:42 -0500 writes:
> >
> >     >>>> Method dispatch for `vec_c()` is quite simple because
> >     >>>> associativity and commutativity mean that we can
> >     >>>> determine the output type only by considering a pair of
> >     >>>> inputs at a time. To this end, vctrs provides
> >     >>>> `vec_type2()` which takes two inputs and returns their
> >     >>>> common type (represented as zero length vector):
> >     >>>>
> >     >>>> str(vec_type2(integer(), double())) #> num(0)
> >     >>>>
> >     >>>> str(vec_type2(factor("a"), factor("b"))) #> Factor w/ 2
> >     >>>> levels "a","b":
> >     >>>
> >     >>>
> >     >>> What is the reasoning behind taking the union of the
> >     >>> levels here? I'm not sure that is actually the behavior
> >     >>> I would want if I have a vector of factors and I try to
> >     >>> append some new data to it. I might want/ expect to
> >     >>> retain the existing levels and get either NAs or an
> >     >>> error if the new data has (present) levels not in the
> >     >>> first data. The behavior as above doesn't seem in-line
> >     >>> with what I understand the purpose of factors to be
> >     >>> (explicit restriction of possible values).
> >     >>
> >     >> Originally (like a week ago ), we threw an error if the
> >     >> factors didn't have the same level, and provided an
> >     >> optional coercion to character. I decided that while
> >     >> correct (the factor levels are a parameter of the type,
> >     >> and hence factors with different levels aren't
> >     >> comparable), that this fights too much against how people
> >     >> actually use factors in practice. It also seems like base
> >     >> R is moving more in this direction, i.e. in 3.4
> >     >> factor("a") == factor("b") is an error, whereas in R 3.5
> >     >> it returns FALSE.
> >
> >     > I now have a better argument, I think:
> >
> >     > If you squint your brain a little, I think you can see
> >     > that each set of automatic coercions is about increasing
> >     > resolution. Integers are low resolution versions of
> >     > doubles, and dates are low resolution versions of
> >     > date-times. Logicals are low resolution version of
> >     > integers because there's a strong convention that `TRUE`
> >     > and `FALSE` can be used interchangeably with `1` and `0`.
> >
> >     > But what is the resolution of a factor? We must take a
> >     > somewhat pragmatic approach because base R often converts
> >     > character vectors to factors, and we don't want to be
> >     > burdensome to users. So we say that a factor `x` has finer
> >     > resolution than factor `y` if the levels of `y` are
> >     > contained in `x`. So to find the common type of two
> >     > factors, we take the union of the levels of each factor,
> >     > given a factor that has finer resolution than
> >     > both. Finally, you can think of a character vector as a
> >     > factor with every possible level, so factors and character
> >     > vectors are coercible.
> >
> >     > (extracted from the in-progress vignette explaining how to
> >     > extend vctrs to work with your own vctrs, now that vctrs
> >     > has been rewritten to use double dispatch)
> >
> > I like this argumentation, and find it very nice indeed!
> > It confirms my own gut feeling which had lead me to agreeing
> > with you, Hadley, that taking the union of all factor levels
> > should be done here.
> >
> > As Gabe mentioned (and you've explained about) the term "type"
> > is really confusing here.  As you know, the R internals are all
> > about SEXPs, TYPEOF(), etc, and that's what the R level
> > typeof(.) also returns.  As you want to use something slightly
> > different, it should be different naming, ideally something not
> > existing yet in the R / S world, maybe 'kind' ?
> >
> > Martin
> >
> >
> >     > Hadley
> >
> >     > --
> >     > http://hadley.nz
> >
> >     > ______________________________________________
> >     > R-devel using r-project.org mailing list
> >     > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
>
> --
> Gabriel Becker, Ph.D
> Scientist
> Bioinformatics and Computational Biology
> Genentech Research
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list