[Rd] Function 'factor' issues

Suharto Anggono Suharto Anggono suharto_anggono at yahoo.com
Sun Oct 15 18:03:48 CEST 2017


In R devel, function 'factor' has been changed, allowing and merging duplicated 'labels'.

Issue 1: Handling of specified 'labels' without duplicates is slower than before.
Example:
x <- rep(1:26, 40000)
system.time(factor(x, levels=1:26, labels=letters))

Function 'factor' is already rather slow because of conversion to character. Please don't add slowdown.

Issue 2: While default 'labels' is 'levels', not specifying 'labels' may be different from specifying 'labels' to be the same as 'levels'.

Example 1:
as.integer(factor(c(NA,2,3), levels = c(2, NA), exclude = NULL))
is different from
as.integer(factor(c(NA,2,3), levels = c(2, NA), labels = c(2, NA), exclude = NULL))

File reg-tests-1d.R indicates that 'factor' behavior with NA is slightly changed, for the better. NA entry (because it is unmatched to 'levels' argument or is in 'exclude') is absorbed into NA in "levels" attribute (comes from 'labels' argument), if any. The issue is that it happens only when 'labels' is specified.

Function 'factor' could use match(xlevs, nlevs)[f]. It doesn't match NA to NA level. When 'f' is long enough, longer than 'xlevs', it is faster than match(xlevs[f], nlevs).

Example 2:
With
levs <- c("A","A")  ,
factor(levs, levels=levs)
gives error, but
factor(levs, levels=levs, labels=levs)
doesn't.

Note: In theory, if function 'factor' merged duplicated 'labels' in all cases, at least in
factor(c(sqrt(2)^2, 2))  ,
function 'factor' could do matching on original 'x' (without conversion to character), as in R before version 2.10.0. If function 'factor' did it,
factor(c(sqrt(2)^2, 2), levels = c(sqrt(2)^2, 2), labels = c("sqrt(2)^2", "2"))
could take sqrt(2)^2 and 2 as distinct.


Another thing: Function 'factor' in R devel uses 'order' instead of 'sort.list'.

The case of as.factor(x) for
x <- as.data.frame(character(0))
in tests/isas-tests.Rout.save reveals that 'order' on data frame is strange.

x <- as.data.frame(character(0))
y <- unique(x)
length(y)  # 1
length(order(y))  # 0
length(as.character(y))  # 1

order(y) is not as long as as.character(y).

Another example:
length(mtcars)  # 11
length(order(mtcars))  # 352



More information about the R-devel mailing list