[Rd] suggestion for extending ?as.factor

Martin Maechler maechler at stat.math.ethz.ch
Mon May 11 17:06:38 CEST 2009


>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
>>>>>     on Sun, 10 May 2009 13:52:53 +0200 writes:

    PS> On Sat, May 09, 2009 at 10:55:17PM +0200, Martin Maechler wrote:
    PS> [...]
    >> If'd revert to such a solution,
    >> we'd have to get back to Peter's point about the issue that
    >> he'd think  table(.) should be more tolerant than as.character()
    >> about "almost equality".
    >> For compatibility reasons, we could also return back to the
    >> reasoning that useR should use {something like}
    >> table(signif(x, 14)) 
    >> instead of
    >> table(x) 
    >> for numeric x in "typical" cases.

    PS> In the released versions 2.8.1 and 2.9.0, function factor() satisfies
    PS> identical(as.character(factor(x)), as.character(x))    (*)
    PS> for all numeric x. This follows from the code (levels are computed by
    PS> as.character() from unmodified input values) and may be verified
    PS> even for the problematic cases, for example
    PS> x <- (0.3 + 2e-16 * c(-2,-1,1,2))
    PS> factor(x)
    PS> # [1] 0.300000000000000 0.3  0.3  0.300000000000000
    PS> # Levels: 0.300000000000000 0.3 0.3 0.300000000000000
    PS> as.character(x)
    PS> # [1] "0.300000000000000" "0.3"               "0.3"              
    PS> # [4] "0.300000000000000"
    PS> identical(as.character(factor(x)), as.character(x))
    PS> # [1] TRUE

    PS> In my opinion, it is reasonable to require that (*) be
    PS> preserved also in future versions of R.

    PS> Function as.character(x) has disadvantages. Besides of
    PS> the platform dependence, it also does not always perform
    PS> rounding needed to eliminate FP errors. Usually,
    PS> as.character(x) rounds to at most 15 digits, so, we get,
    PS> for example

    PS> as.character(0.1 + 0.2) # [1] "0.3"
    PS> as required. However, there are also exceptions, for example
    PS> as.character(1e19 + 1e5) # [1] "10000000000000100352"

    PS> Here, the number is printed exactly, so the resulting
    PS> string contains the FP error caused by the fact that
    PS> 1e19 + 1e5 has more than 53 significant digits in binary
    PS> representation, namely 59.

    PS> binary representation of 1e19 + 1e5 is
    PS> 1000101011000111001000110000010010001001111010011000011010100000

    PS> binary representation of 10000000000000100352 is
    PS> 1000101011000111001000110000010010001001111010011000100000000000

    PS> However, as.character(x) seems to do enough rounding for
    PS> most purposes, otherwise it would not be suitable as the
    PS> basic numeric to character conversion. If table() needs
    PS> factor() with a different conversion than
    PS> as.character(x), it may be done explicitly as discussed
    PS> by Martin above.

    PS> So, i suggest to use as.character() as the default
    PS> conversion in factor(), so that
    PS> identical(as.character(factor(x)), as.character(x)) is
    PS> satisfied for the default usage of factor().

    PS> Of course, i appreciate, if factor() has parameters,
    PS> which allow better control of the underlying conversion,
    PS> as it is done in the current development versions.

The version I have committed a few hours ago is indeed a much
re-simplified version, using  as.character(.) explicitly
and consequently no longer providing the extra optional
arguments that we have had for a couple of days.

Keeping such a basic function   factor()  as simple as possible 
seems a good strategy to me.

Martin Maechler



More information about the R-devel mailing list