[Rd] duplicated factor labels.

Martin Maechler maechler at stat.math.ethz.ch
Thu Jun 22 11:43:59 CEST 2017


>>>>> Paul Johnson <pauljohn32 at gmail.com>
>>>>>     on Fri, 16 Jun 2017 11:02:34 -0500 writes:

    > On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <jorismeys at gmail.com> wrote:
    >> To extwnd on Martin 's explanation :
    >> 
    >> In factor(), levels are the unique input values and labels the unique output
    >> values. So the function levels() actually displays the labels.
    >> 

    > Dear Joris

    > I think we agree. Currently, factor insists both levels and labels be unique.

    > I wish that it would not accept nonunique labels. I also understand it
    > is impractical to change this now in base R.

    > I don't think I succeeded in explaining why this would be nicer.
    > Here's another example. Fairly often, we see input data like

    > x <- c("Male", "Man", "male", "Man", "Female")

    > The first four represent the same value.  I'd like to go in one step
    > to a new factor variable with enumerated types "Male" and "Female".
    > This fails

    > xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
    > labels = c("Male", "Male", "Male", "Female"))

    > Instead, we need 2 steps.

    > xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
    > levels(xf) <- c("Male", "Male", "Male", "Female")

    > I think it is quirky that `levels<-.factor` allows the duplicated
    > labels, whereas factor does not.

    > I wrote a function rockchalk::combineLevels to simplify combining
    > levels, but most of the students here like plyr::mapvalues to do it.
    > The use of levels() can be tricky because one must enumerate all
    > values, not just the ones being changed.

    > But I do understand Martin's point. Its been this way 25 years, it
    > won't change. :).

Well.. the above is a bit out of context.

Your first example really did not make a point to me (and Joris)
and I showed that you could use even two different simple factor() calls to
produce what you wanted 
	yc <- factor(c("1",NA,NA,"4","4","4"))
	yn <- factor(c( 1, NA,NA, 4,  4,  4))

Your new example is indeed  much more convincing !

(Note though that the two steps that are needed can be written 
 more shortly

The  "been this way 25 years"  is one a reason to be very
cautious(*) with changes, but not a reason for no changes!

(*) Indeed as some of you have noted we really should not "break behavior".
    This means to me we cannot accept a change there which gives
    an error or a different result in cases the old behavior gave a valid factor.

I'm looking at a possible change currently
[not promising that a change will happen ...]


Martin



More information about the R-devel mailing list