[Rd] suggestion for extending ?as.factor

Tue May 5 10:35:42 CEST 2009

>>>>> "PD" == Peter Dalgaard <P.Dalgaard at biostat.ku.dk>
>>>>>     on Mon, 04 May 2009 19:28:06 +0200 writes:

    PD> Petr Savicky wrote:
    >> On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote:
    >> [snip]
    >>> Let me quickly expand the tasks we have wanted to address, when
    >>> I started changing factor() for R-devel.
    >>> 
    >>> 1) R-core had unanimously decided that R 2.10.0 should not allow
    >>> duplicated levels in factors anymore.
    >>> 
    >>> When working on that, I had realized that quite a few bits of code
    >>> were implicitly relying on duplicated levels (or something
    >>> related), see below, so the current version of R-devel only
    >>> *warns* in some cases where duplicated levels are produced
    >>> instead of giving an error.
    >>> 
    >>> What I had also found was that basically, even our own (!) code
    >>> and quite a bit of user code has more or less relied on other
    >>> things that were not true (even though "almost always" fulfilled):
    >>> 
    >>> 2) if x contains no duplicated values, then  factor(x) should neither
    >>> 
    >>> 3) factor(x) constructs a factor object with *unique* levels
    >>> 
    >>> {This is what our decision "1)" implies and now enforces}
    >>> 
    >>> 4) as.numeric(names(table(x))) should be  identical to unique(x)
    >>> 
    >>> where "4)" is basically ensured by "3)" as table() calls
    >>> factor() for non-factor args.
    >>> 
    >>> As mentioned the bad thing is that "2) - 4)" are typically
    >>> fulfilled in all tests package writers would use.
    >>> 
    >>> Concerning '3)' [and '1)'], as you know, inside R-core we have
    >>> proposed to at least ensure that  `levels<-` 
    >>> should not allow duplicated levels, 
    >>> and I had concluded that
    >>> a) factor() really should use  `levels<-` instead of the low-level	
    >>> attr(., "levels") <- ....
    >>> b) factor() itself must make sure that the default levels became unique.
    >>> 
    >>> ---
    >>> 
    >>> Given Petr's (and more) examples and the strong requirement of
    >>> "user convenience" and back-compatibility,
    >>> I now tend to agree (with Peter) that we cannot ensure all of 2)
    >>> and 4) still allow factor() to behave as it did for "rounded
    >>> decimal numbers",
    >>> and consequently would have to (continue to) not ensuring
    >>> properties (2) and (4).
    >>> Something quite unfortunate, since, as I said, much useR code
    >>> implicitly relies on these, and so that code is buggy even
    >>> though the bug will only show in exceptional cases.
    >> 
    >> Let me suggest to consider also the following algorithm: determine
    >> the number of digits needed to preserve the double value exactly for
    >> each number separately. An R code prototype demonstrating the 
    >> algorithm could be as follows
    >> 
    >> convert <- function(x) # x should be a single number
    >> {
    >> for (d in 1:16) {
    >> y <- sprintf(paste("%.", d, "g", sep=""), x)
    >> if (x == as.numeric(y)) {
    >> return(y)
    >> }
    >> }
    >> return(sprintf("%.17g", x))
    >> }
    >> 
    >> For this, we get
    >> 
    >> > convert(0.3)
    >> [1] "0.3"
    >> > convert(1/3)
    >> [1] "0.3333333333333333" # 16 digits suffice
    >> > convert(0.12345)
    >> [1] "0.12345"
    >> > convert(0.12345678901234567)
    >> [1] "0.12345678901234566"
    >> > 0.12345678901234567 == as.numeric("0.12345678901234566")
    >> [1] TRUE
    >> 
    >> This algorithm is slower than a single call to sprintf("%.17g", x), but it
    >> produces nicer numbers, if possible, and guarantees that the value is
    >> always preserved.

    PD> Yes, but....

    >> convert(0.2+0.1)
    PD> [1] "0.30000000000000004"

    PD> I think that the real issue is that we actually do want almost-equal
    PD> numbers to be folded together. 

in most cases, but not all {*when*  levels is not specified},
but useR's code sometimes *does* rely on  factor()  /  table()
using exact values.

Also, what should happen when the user explicitly calls

      factor(x, levels = sort(unique(x)))

at least in that case we really should *not* fold almost equals.
and the "old" code (<= R 2.9.0) did fold them in border cases,
and lead non-unique levels.

Can we agree that any rounding etc - if needed - will only
happen when
  1) missing(levels)
  2) is.numeric(x) || is.complex(x)

I'm also thinking of at least keeping the current behavior as an
option, e.g. by  factor(x, ...., keepUniqueness = TRUE, ....)
where the default would be keepUniqueness = FALSE.

    PD> The most relevant case I can conjure up is this (permutation testing):

    >> zz <- replicate(20000,sum(sample(sleep$extra,10)))
    >> length(table(zz))
    PD> [1] 427
    >> length(table(signif(zz,7)))
    PD> [1] 281

    PD> Notice that the discrepancy comes from sums that really are identical
    PD> values (in decimal arithmetic), but where the binary FP inaccuracy makes
    PD> them slightly different.

Yes, that's a good example.

However, I now think it would be helpful to slightly separate
the issue from what factor() should do from 
how table() should call factor() in those cases it does.

Also, you originally said

 >> We might as well have abolished conversion of numerics to
 >> factor altogether.

and I now tend to think that there's something in that:
The conversion of non-integer numerics to factors could indeed
be something the user should be urged to with care,
but yes, I agree that a direct   factor(format(x), ...)  is too
cheap.

    PD> [for a nice picture, continue the example with

    >> tt <- table(signif(zz,7))
    >> plot(as.numeric(names(tt)),tt, type="h")

    PD> ]

    PD> -- 
    PD> O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
    PD> c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
    PD> (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
    PD> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907