[Rd] suggestion for extending ?as.factor

Petr Savicky savicky at cs.cas.cz
Sun May 10 13:52:53 CEST 2009


On Sat, May 09, 2009 at 10:55:17PM +0200, Martin Maechler wrote:
[...]
> If'd revert to such a solution,
> we'd have to get back to Peter's point about the issue that
> he'd think  table(.) should be more tolerant than as.character()
> about "almost equality".
> For compatibility reasons, we could also return back to the
> reasoning that useR should use {something like}
>     table(signif(x, 14)) 
> instead of
>     table(x) 
> for numeric x in "typical" cases.

In the released versions 2.8.1 and 2.9.0, function factor() satisfies
  identical(as.character(factor(x)), as.character(x))    (*)
for all numeric x. This follows from the code (levels are computed by
as.character() from unmodified input values) and may be verified
even for the problematic cases, for example
  x <- (0.3 + 2e-16 * c(-2,-1,1,2))
  factor(x)
  # [1] 0.300000000000000 0.3               0.3               0.300000000000000
  # Levels: 0.300000000000000 0.3 0.3 0.300000000000000
  as.character(x)
  # [1] "0.300000000000000" "0.3"               "0.3"              
  # [4] "0.300000000000000"
  identical(as.character(factor(x)), as.character(x))
  # [1] TRUE

In my opinion, it is reasonable to require that (*) be preserved also in future
versions of R.

Function as.character(x) has disadvantages. Besides of the platform dependence,
it also does not always perform rounding needed to eliminate FP errors. Usually,
as.character(x) rounds to at most 15 digits, so, we get, for example
  as.character(0.1 + 0.2) # [1] "0.3"
as required. However, there are also exceptions, for example
  as.character(1e19 + 1e5) # [1] "10000000000000100352"

Here, the number is printed exactly, so the resulting string contains the FP error
caused by the fact that 1e19 + 1e5 has more than 53 significant digits in binary
representation, namely 59.

  binary representation of 1e19 + 1e5 is
  1000101011000111001000110000010010001001111010011000011010100000

  binary representation of 10000000000000100352 is
  1000101011000111001000110000010010001001111010011000100000000000

However, as.character(x) seems to do enough rounding for most purposes, otherwise
it would not be suitable as the basic numeric to character conversion. If table() needs
factor() with a different conversion than as.character(x), it may be done explicitly
as discussed by Martin above.

So, i suggest to use as.character() as the default conversion in factor(), so that 
  identical(as.character(factor(x)), as.character(x))
is satisfied for the default usage of factor().

Of course, i appreciate, if factor() has parameters, which allow better control
of the underlying conversion, as it is done in the current development versions.

Petr.



More information about the R-devel mailing list