[Rd] Match .3 in a sequence

Tue Mar 17 16:21:44 CET 2009

On Tue, Mar 17, 2009 at 10:04:39AM -0400, Stavros Macrakis wrote:
...
> 1) Factor allows repeated levels, e.g. factor(c(1),c(1,1,1)), with no
> warning or error.

Yes, this is a confusing behavior, since repeated levels are never meaningful.

> 2) Even from distinct inputs, factor of a numeric vector may generate
> repeated levels, because it only uses 15 digits.

I think, 15 digits is a reasonable choice. Mapping double precision numbers
and character strings with a given decimal precision is never bijective.
With 15 digits, we can achive that every character value has unique double
precision representation, but not vice versa. With 17 digits, we have a unique
character string for each double precision number, but not vice versa.
What is better?

Specification of as.character says() that the numbers are represented with
15 significant digits. So, I think, if as.factor() applies signif(,digits=15)
to a numeric vector before determining the levels using sort(unique.default(x),
this could help to eliminate most of the problems without being in conflict
with the existing specification.

> 3) The algorithm to determine the shortest format is inconsistent with
> the algorithm to actually print, giving pathological cases like 0.3
> vs. 0.300000000000000.

I do not exactly understand what you mean by inconsistent. If you do
  nums <- (.3 + 2e-16 * c(-2,-1,1,2))
  options(digits=15)
  for (x in nums) print(x)
  # [1] 0.300000000000000
  # [1] 0.3
  # [1] 0.3
  # [1] 0.300000000000000
  as.character(nums)
  # [1] "0.300000000000000" "0.3"               "0.3"              
  # [4] "0.300000000000000"
then print and as.character are consistent. Printing the whole vector
behaves differently, since it uses the same format for all numbers.

> The original problem was testing whether a floating-point number was a
> member of a vector.  rounding and then converting to a factor seem
> like a very poor way of doing that, even if the above problems were
> resolved.  Comparing with a tolerance seems much more robust, clean,
> and efficient.

Definitely, using comparison tolerance is a meaningful approach. Its disadvantage
is that the relation abs(x - y) <= eps is not transitive. So, it may also produce
confusing results in some situations. I think that one has to choose the right
solution depending on the application.

Petr.