[Rd] Function 'factor' issues

Wed Oct 18 19:59:59 CEST 2017

Martin, Suharto, et al.,

On Wed, Oct 18, 2017 at 9:54 AM, Martin Maechler <maechler at stat.math.ethz.ch
> wrote:

> *<snip>*
>
>     > Note: In theory, if function 'factor' merged duplicated 'labels' in
> all cases, at least in
>     > factor(c(sqrt(2)^2, 2))  ,
>     > function 'factor' could do matching on original 'x' (without
> conversion to character), as in R before version 2.10.0. If function
> 'factor' did it,
>     > factor(c(sqrt(2)^2, 2), levels = c(sqrt(2)^2, 2), labels =
> c("sqrt(2)^2", "2"))
>     > could take sqrt(2)^2 and 2 as distinct.
>
> Well, that may be interesting.. but I doubt if that's somewhere
> we should go, easily, because  factor() has been documented to do
> what it does now (with very slightly rounding such numbers via
> as.character(.))
> and hence such a change would typically lead to much work for
> too many people.
>
> I do see that indeed the  as.character(.) inside factor() takes
> most of the CPU time used in largish factor() examples [as your
> first], and indeed, for the case of integer 'x', we really could
> be much faster in factor construction.
>

Indeed; the ALTREP framework already has an alternative string (character
vector) implementation which defers conversion from another vector type.
Luke implemented it to drastically speed up the creation of default row
labels (1:n, I believe) on design matrices within lm/glm, under the
assumption that no one is ever going to look at the design matrix row
labels most of the time, and there's no reason to pay the cost of creating
them until they do.

In principle, we could look at doing the same for levels of a factor.

Furthermore, while I haven't put in the hooks yet to utilize them yet, even
in my local copy, ALTREP classes are allowed to include a custom unique
method, so in cases where

unique(as.character(x)) == as.character(unique(x))

we could avoid the conversion even when calling unique() (at the R level)
on such a deferred vector.

This would include, I think, the case where inferring levels from the
vector. I believe the point Martin made earlier about as.character
potentially doing some rounding means that the required identity would hold
when factors are being generated from integer vectors but would not be
guaranteed when factors are generated from (non-integer) numeric vectors.

Best,
~G

>
>     > Another thing: Function 'factor' in R devel uses 'order' instead of
> 'sort.list'.
>
> This has been by a change on purpose --- well documented as new
> feature in NEWS --- to allow using *methods* for order(),
> i.e. for the workhorse of order, xtfrm()  so that factor(OB)
> works for more general objects OB.
>
>
>     > The case of as.factor(x) for
>     > x <- as.data.frame(character(0))
>     > in tests/isas-tests.Rout.save reveals that 'order' on data frame is
> strange.
>
>     > x <- as.data.frame(character(0))
>     > y <- unique(x)
>     > length(y)  # 1
>     > length(order(y))  # 0
>     > length(as.character(y))  # 1
>
>     > order(y) is not as long as as.character(y).
>
>     > Another example:
>     > length(mtcars)  # 11
>     > length(order(mtcars))  # 352
>
> I agree that  order(<data.frame>) may look a bit strange;
> I've spent more than an hour into looking into it, and making it
> [actually,  rank(<data.frame>,..) ]
> an error, but ended up finding much evidence that there's too
> much related code, sometimes even in base R which assumes that a
> numeric data frame behaves the same as a numeric matrix.
>
> And also, if you carefully read the help files, of
>   order(),
>   xtfrm(),
>   rank()
>
> there's always mentioned that these work for R object 'x'
> basically as long as   x[!is.na(x)]   returns a "nice"
> (typically atomic) vector .. which is the case for such data frames.
>
> The consequence, that  in R-devel, currently
>
>     factor(mtcars)
>
> just "works",  is indeed unexpected or even "shocking", and I
> still don't know what the most elegant and reasonable way would
> be to make this an error -- as it used to be when  sort.list()
> was used instead of order().  I'd find it ugly (and even more
> time consuming!) if factor() itself would have to check its
> argument and signal an error for a data.frame.
>
> The relevant call tree is
>
>   factor() -> order() -> xtfrm() -> xtfrm.default() -> rank()
>
> and as I said, rank(x,*) works when  x[!is.na(x)]  is an atomic
> "numeric-like" vector  which is the case for a numeric data
> frame such as 'mtcars'.
>
> Martin Maechler
> ETH Zurich
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Gabriel Becker, PhD
Scientist (Bioinformatics)
Genentech Research

	[[alternative HTML version deleted]]