[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

Petr Savicky savicky at cs.cas.cz
Sat May 23 09:44:54 CEST 2009


Function factor() in the current development version (2009-05-22)
guarantees that levels are different character strings. However, they
may represent the same decimal number. The following example is derived
from a posting by Stavros Macrakis in thread "Match .3 in a sequence"
in March

  nums <- 0.3 + 2e-16 * c(-2,-1,1,2)
  f <- factor(nums)
  levels(f)
  # [1] "0.300000000000000" "0.3"              

The levels differ in trailing zeros, but represent the same decimal number.
Besides that this is not really meaningful, it may cause a problem, when
using as.numeric(levels(f)).

In the above case, as.numeric() works fine and maps the two levels to the same
number. However, there are cases, where the difference in trailing zeros
implies different values in as.numeric(levels(f)) and these values may even
form a decreasing sequence although levels were constructed from an increasing
sequence of numbers.

Examples are platform dependent, but may be found by the following code.
Tested on Intel under Linux (both with and without SSE) and also under Windows
with an older version of R.

  for (i in 1:100000) {
      x <- 10^(floor(runif(1, 61, 63)) + runif(1)/2)
      x <- as.numeric(sprintf("%.14g", x))
      eps <- 2^(floor(log2(x)) - 52)
      k <- round(x * c(5e-16, 1e-15) / eps)
      if (x > 1e62) { k <- rev( - k) }
      y <- x + k[1]:k[2] * eps
      ind <- which(diff(as.numeric(as.character(y))) < 0)
      for (j in ind) {
          u1 <- y[c(j, j+1)]
          u2 <- factor(u1)
          print(levels(u2))
          print(diff(as.numeric(levels(u2))))
          aux <- readline("next")
      }
  }

An example of the output is

  [1] "1.2296427920313e+61"  "1.22964279203130e+61"
  [1] -1.427248e+45
  next
  [1] "1.82328862326830e+62" "1.8232886232683e+62" 
  [1] -2.283596e+46
  next

The negative number in diff(as.numeric(levels(u2))) demonstrates cases,
when as.numeric(levels(u2)) is decreasing. We can also see that the reason
is that the two strings in levels(u2) differ in the trailing zeros.

I did quite intensive search for such examples for all possible exponents
(not only 61 and 62 and a week of CPU on three processors) and all the obtained
examples were caused by a difference in trailing zeros. So, i believe that
removing trailing zeros from the output of as.character(x) solves the problem
with the reversed order in as.numeric(levels(factor(x))) entirely.

A patch against R-devel_2009-05-22, which eliminates trailing zeros
from as.character(x), but makes no other changes to as.character(x),
is in an attachment. Using the patch, we obtain a better result also
in the following.

  nums <- 0.3 + 2e-16 * c(-2,-1,1,2)
  factor(nums)
  # [1] 0.3 0.3 0.3 0.3
  # Levels: 0.3

Petr.

-------------- next part --------------
--- R-devel/src/main/coerce.c	2009-04-17 17:53:35.000000000 +0200
+++ R-devel-elim-trailing/src/main/coerce.c	2009-05-23 08:39:03.914774176 +0200
@@ -294,12 +294,33 @@
     else return mkChar(EncodeInteger(x, w));
 }
 
+const char *elim_trailing(const char *s, char cdec)
+{
+    const char *p;
+    char *replace;
+    for (p = s; *p; p++) {
+        if (*p == cdec) {
+            replace = (char *) p++;
+            while ('0' <= *p & *p <= '9') {
+                if (*(p++) != '0') {
+                    replace = (char *) p;
+                }
+            }
+            while (*(replace++) = *(p++)) {
+                ;
+            }
+            break;
+        }
+    }
+    return s;
+}
+
 SEXP attribute_hidden StringFromReal(double x, int *warn)
 {
     int w, d, e;
     formatReal(&x, 1, &w, &d, &e, 0);
     if (ISNA(x)) return NA_STRING;
-    else return mkChar(EncodeReal(x, w, d, e, OutDec));
+    else return mkChar(elim_trailing(EncodeReal(x, w, d, e, OutDec), OutDec));
 }
 
 SEXP attribute_hidden StringFromComplex(Rcomplex x, int *warn)


More information about the R-devel mailing list