# [Rd] duplicated factor labels.

Paul Johnson pauljohn32 at gmail.com
Thu Jun 15 02:00:11 CEST 2017

```Dear R devel

time, but can one of you help me understand this?

This concerns duplicated labels, not levels, in the factor function.

I think it is hard to understand that factor() fails, but levels()
after does not

>  x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
Error in `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels,  :
factor level  is duplicated
> y <- factor(x, levels = xlevels)
> levels(y) <- xlabels
> y
 1    <NA> <NA> 4    4    4
Levels: 1 4

If the latter use of levels() causes a good, expected result, couldn't
factor(..., labels = xlabels) be made to the same thing?

That's the gist of it. To signal to you that I've been trying to
figure this out on my own, here is a revision I've tested in R's
factor function which "seems" to fix the matter. (Of course, probably
causes lots of other problems I don't understand, that's why I'm
writing to  you now.)

In the factor function, the class of f is assigned *after* levels(f) is called

levels(f) <- ## nl == nL or 1
if (nl == nL) as.character(labels)
else paste0(labels, seq_along(levels))
class(f) <- c(if(ordered) "ordered", "factor")

At that point, f is an integer, and levels(f) is a primitive

> `levels<-`
function (x, value)  .Primitive("levels<-")

That's what generates the error.  I don't understand well what
.Primitive means here. I need to walk past that detail.

Suppose I revise the factor function to put the class(f) line before
the level(). Then `levels<-.factor` is called and all seems well.

factor <- function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
{
if (is.null(x))
x <- character()
nx <- names(x)
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
levels <- unique(as.character(y)[ind])
}
force(ordered)
if (!is.character(x))
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))]
f <- match(x, levels)
if (!is.null(nx))
names(f) <- nx
nl <- length(labels)
nL <- length(levels)
if (!any(nl == c(1L, nL)))
stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
nl, nL), domain = NA)
## class() moved up 3 rows
class(f) <- c(if (ordered) "ordered", "factor")
levels(f) <- if (nl == nL)
as.character(labels)
else paste0(labels, seq_along(levels))
f
}

> assignInNamespace("factor", factor, "base")
> x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
> y
 1    <NA> <NA> 4    4    4
Levels: 1 4
> attributes(y)
\$class
 "factor"

\$levels
 "1" "4"

That's a "good" answer for me.

But I broke your function. I eliminated the check for duplicated levels.

> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
> y
 1    4    <NA> <NA> <NA> <NA>
Levels: 1 4

Rather than have factor return the "duplicated levels" error when
there are duplicated values in labels, I wonder why it is not better
to have a check for duplicated levels directly. For example, insert a
new else in this stanza

if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
levels <- unique(as.character(y)[ind])
} ##next is new part
else {
levels <- unique(levels)
}

That will cause an error when there are duplicated levels because
there are more labels than levels:

> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
Error in factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) :
invalid 'labels'; length 6 should be 1 or 2

So, in conclusion, if levels() can work after creating a factor, I
wish equivalent labels argument would be accepted. What is your
opinion?

pj
--
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

```