[Rd] suggestion for extending ?as.factor

Martin Maechler maechler at stat.math.ethz.ch
Mon May 4 17:39:52 CEST 2009


>>>>> "PD" == Peter Dalgaard <P.Dalgaard at biostat.ku.dk>
>>>>>     on Mon, 04 May 2009 15:34:09 +0200 writes:

    PD> Martin Maechler wrote:
    >>>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
    >>>>>>> on Sun, 3 May 2009 22:32:04 +0200 writes:
    >>>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
    >>>>>>> on Sun, 3 May 2009 22:32:04 +0200 writes:
    >> 
    PS> In R-2.10.0, the development version, function as.factor() uses 17 digit
    PS> precision for conversion of numeric values to character type. This
    PS> is very good for the consistency of the resulting factor, however,
    PS> i expect that people will complain about, for example, as.factor(0.3)
    PS> being
    PS> [1] 0.29999999999999999
    PS> Levels: 0.29999999999999999
    >> 
    PS> I suggest to extend the "Warning" section of ?as.factor by the following
    PS> paragraph.
    >> 
    PS> If as.factor() is used for a numeric vector, then the numbers are
    PS> converted to character strings with 17 digit precision using their
    PS> machine representation. This guarantees that different numbers are
    PS> converted to different levels, but may produce unwanted results, if
    PS> the numbers are expected to have limited number of decimal positions.
    PS> For example, as.factor(c(0.1, 0.2, 0.3)) produces
    PS> [1] 0.10000000000000001 0.20000000000000001 0.29999999999999999
    PS> Levels: 0.10000000000000001 0.20000000000000001 0.29999999999999999
    PS> In order to avoid this, convert the numbers to a character vector
    PS> using formatC() or a similar function before using as.factor().
    >> 
    PS> Petr.
    >> 
    >> Thank you, Petr, for the good suggestion.
    >> 
    >> I have added a (shorter) paragraph, though to the 'Details' not the
    >> 'Warning' section, and also one to the 'Examples' :
    >> 
    >> ## Converting (non-integer) numbers:
    >> as.factor(c(0.1, 0.2, 0.3)) # maybe not what you'd expect, so rather use
    >> factor(format(c(0.1, 0.2, 0.3)))

    PD> Martin,

    PD> I tend to consider this a bug, plain and simple. We might as well have
    PD> abolished conversion of numerics to factor altogether.

hmm, that comparison *is* an exaggeration ((Much code would have
stopped working had we implemented the latter !!))

    PD> (Notice, BTW, that conversions to mode "character"
    PD> changes the sort order so format() is not a universal
    PD> fix. IIRC, we did consider the 1 10 2 3 4 5 6 7 8 9
    PD> issue when designing R's version factor().)

yes, but I did not understand why this is relevant here;
as
  > factor(c(10,1:9, 8:2,1:5))
   [1] 10 1  2  3  4  5  6  7  8  9  8  7  6  5  4  3  2  1  2  3  4  5 
  Levels: 1 2 3 4 5 6 7 8 9 10

also in earlier versions of R.

    PD> The current R-devel behaviour is silly and we should just get rid of it
    PD> before a final release. It should be the other way around: If people
    PD> rely on whether numerical factor levels differ with 17 digits precision,
    PD> THEN they should use format with suitable arguments.

Hmm, I know tend to agree that we must further change some of
the R-devel behavior.

    PD> If we have issues with numeric values that are very slightly different
    PD> but round to get the same level name, how about putting something like

    PD> if (is.numeric(x)) x <- zapsmall(x)

well, rather  

      if (is.numeric(x)) x <- signif(x, 15)

where '15' could be replaced by 7 or other values in 5:20

    PD> somewhere at the start of the factor() function?

Let me quickly expand the tasks we have wanted to address, when
I started changing factor() for R-devel.

1) R-core had unanimously decided that R 2.10.0 should not allow
   duplicated levels in factors anymore.

When working on that, I had realized that quite a few bits of code
were implicitly relying on duplicated levels (or something
related), see below, so the current version of R-devel only
*warns* in some cases where duplicated levels are produced
instead of giving an error.

What I had also found was that basically, even our own (!) code
and quite a bit of user code has more or less relied on other
things that were not true (even though "almost always" fulfilled):

2) if x contains no duplicated values, then  factor(x) should neither

3) factor(x) constructs a factor object with *unique* levels

  {This is what our decision "1)" implies and now enforces}

4) as.numeric(names(table(x))) should be  identical to unique(x)

  where "4)" is basically ensured by "3)" as table() calls
  factor() for non-factor args.

As mentioned the bad thing is that "2) - 4)" are typically
fulfilled in all tests package writers would use.

Concerning '3)' [and '1)'], as you know, inside R-core we have
proposed to at least ensure that  `levels<-` 
should not allow duplicated levels, 
and I had concluded that
a) factor() really should use  `levels<-` instead of the low-level	
   attr(., "levels") <- ....
b) factor() itself must make sure that the default levels became unique.

---

Given Petr's (and more) examples and the strong requirement of
"user convenience" and back-compatibility,
I now tend to agree (with Peter) that we cannot ensure all of 2)
and 4) still allow factor() to behave as it did for "rounded
decimal numbers",
and consequently would have to (continue to) not ensuring
properties (2) and (4).
Something quite unfortunate, since, as I said, much useR code
implicitly relies on these, and so that code is buggy even
though the bug will only show in exceptional cases.


Best,
Martin



More information about the R-devel mailing list