[Rd] duplicated factor labels.

Martin Maechler maechler at stat.math.ethz.ch
Fri Jun 23 14:24:32 CEST 2017


>>>>> peter dalgaard <pdalgd at gmail.com>
>>>>>     on Fri, 23 Jun 2017 11:51:05 +0200 writes:

    > Hmm, the danger in this is that duplicated factor levels _used_ to be allowed (i.e. multiple codes with the same level). Disallowing it is what broke read.spss() on some files, because SPSS's concept of value labels is not 1-to-1 with factors. 
    > Reallowing it with different semantics could be premature. I mean, if we hadn't had the "forbidden" step, read.spss() could have changed behaviour unnoticed. So what if there is code relying on duplicate factor levels, which hasn't been run for some time?

Good point... but I think we should be relatively safe .. unless
"some time" is ca. 8 years :

We have had a warning for these for ca. 7.5 years, namely from
R version 2.10.0 (2009-10-26)    up to 
R version 3.3.3 (2017-03-06) -- "Another Canoe" 

   > factor(1:2, labels = c("A","A"))
   [1] A A
   Levels: A A
   Warning message:
   In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
   duplicated levels in factors are deprecated

   > x <- c("Male", "Man", "male", "Man", "Female")
   > ## the new "direct" way:
   xf <- factor(x, levels = c("Male", "Man",  "male", "Female"),
		   labels = c("Male", "Male", "Male", "Female"))
   Warning message:
   In `levels<-`(`*tmp*`, value = c("Male", "Male", "Male", "Female" :
     duplicated levels will not be allowed in factors anymore
   > xf
   [1] Male   Male   Male   Male   Female
   Levels: Male Male Male Female
   > 

which gave a result somewhat similar to the new R-devel
result.  I would argue the new result should be fine....

Yes, if unwise people used  suppressWarnings(.) around their
code, they may be surprised now.... but that's what you get if
you suppress warnings without enough thought, no ?



    > -pd

    >> On 23 Jun 2017, at 10:42 , Martin Maechler <maechler at stat.math.ethz.ch> wrote:
    >> 
    >>>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
    >>>>>>> on Thu, 22 Jun 2017 11:43:59 +0200 writes:
    >> 
    >>>>>>> Paul Johnson <pauljohn32 at gmail.com>
    >>>>>>> on Fri, 16 Jun 2017 11:02:34 -0500 writes:
    >> 
    >>>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <jorismeys at gmail.com> wrote:
    >>>>> To extwnd on Martin 's explanation :
    >>>>> 
    >>>>> In factor(), levels are the unique input values and labels the unique output
    >>>>> values. So the function levels() actually displays the labels.
    >>>>> 
    >> 
    >>>> Dear Joris
    >> 
    >>>> I think we agree. Currently, factor insists both levels and labels be unique.
    >> 
    >>>> I wish that it would not accept nonunique labels. I also understand it
    >>>> is impractical to change this now in base R.
    >> 
    >>>> I don't think I succeeded in explaining why this would be nicer.
    >>>> Here's another example. Fairly often, we see input data like
    >> 
    >>>> x <- c("Male", "Man", "male", "Man", "Female")
    >> 
    >>>> The first four represent the same value.  I'd like to go in one step
    >>>> to a new factor variable with enumerated types "Male" and "Female".
    >>>> This fails
    >> 
    >>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
    >>>> labels = c("Male", "Male", "Male", "Female"))
    >> 
    >>>> Instead, we need 2 steps.
    >> 
    >>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
    >>>> levels(xf) <- c("Male", "Male", "Male", "Female")
    >> 
    >>>> I think it is quirky that `levels<-.factor` allows the duplicated
    >>>> labels, whereas factor does not.
    >> 
    >>>> I wrote a function rockchalk::combineLevels to simplify combining
    >>>> levels, but most of the students here like plyr::mapvalues to do it.
    >>>> The use of levels() can be tricky because one must enumerate all
    >>>> values, not just the ones being changed.
    >> 
    >>>> But I do understand Martin's point. Its been this way 25 years, it
    >>>> won't change. :).
    >> 
    >>> Well.. the above is a bit out of context.
    >> 
    >>> Your first example really did not make a point to me (and Joris)
    >>> and I showed that you could use even two different simple factor() calls to
    >>> produce what you wanted 
    >>> yc <- factor(c("1",NA,NA,"4","4","4"))
    >>> yn <- factor(c( 1, NA,NA, 4,  4,  4))
    >> 
    >>> Your new example is indeed  much more convincing !
    >> 
    >>> (Note though that the two steps that are needed can be written 
    >>> more shortly
    >> 
    >>> The  "been this way 25 years"  is one a reason to be very
    >>> cautious(*) with changes, but not a reason for no changes!
    >> 
    >>> (*) Indeed as some of you have noted we really should not "break behavior".
    >>> This means to me we cannot accept a change there which gives
    >>> an error or a different result in cases the old behavior gave a valid factor.
    >> 
    >>> I'm looking at a possible change currently
    >>> [not promising that a change will happen ...]
    >> 
    >> In the end, I've liked the change (after 2-3 iterations), and
    >> now been brave to commit to R-devel (svn 72845).
    >> 
    >> With the change, I had to disable one of our own regression
    >> checks (tests/reg-tests-1b.R, line 726):
    >> 
    >> The following is now (in R-devel -> R 3.5.0) valid:
    >> 
    >>> factor(1:2, labels = c("A","A"))
    >> [1] A A
    >> Levels: A
    >>> 
    >> 
    >> I wonder how many CRAN package checks will "break" from
    >> this (my guess is in the order of a dozen), but I hope
    >> that these breakages will be benign, e.g., similar to the above
    >> case where before an error was expected via tools :: assertError(.)
    >> 
    >> Martin
    >> 
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > -- 
    > Peter Dalgaard, Professor,
    > Center for Statistics, Copenhagen Business School
    > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
    > Phone: (+45)38153501
    > Office: A 4.23
    > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list