[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

Martin Maechler maechler at stat.math.ethz.ch
Wed May 27 22:51:38 CEST 2009


>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
>>>>>     on Sat, 23 May 2009 09:44:54 +0200 writes:

    PS> Function factor() in the current development version
    PS> (2009-05-22) guarantees that levels are different
    PS> character strings. However, they may represent the same
    PS> decimal number. The following example is derived from a
    PS> posting by Stavros Macrakis in thread "Match .3 in a
    PS> sequence" in March

    PS>   nums <- 0.3 + 2e-16 * c(-2,-1,1,2) f <- factor(nums)
    PS> levels(f) # [1] "0.300000000000000" "0.3"

    PS> The levels differ in trailing zeros, but represent the
    PS> same decimal number.  Besides that this is not really
    PS> meaningful, it may cause a problem, when using
    PS> as.numeric(levels(f)).

    PS> In the above case, as.numeric() works fine and maps the
    PS> two levels to the same number. However, there are cases,
    PS> where the difference in trailing zeros implies different
    PS> values in as.numeric(levels(f)) and these values may
    PS> even form a decreasing sequence although levels were
    PS> constructed from an increasing sequence of numbers.

    PS> Examples are platform dependent, but may be found by the
    PS> following code.  Tested on Intel under Linux (both with
    PS> and without SSE) and also under Windows with an older
    PS> version of R.

    PS> for (i in 1:100000) {
    PS> x <- 10^(floor(runif(1, 61, 63)) + runif(1)/2)
    PS> x <- as.numeric(sprintf("%.14g", x))
    PS> eps <- 2^(floor(log2(x)) - 52)
    PS> k <- round(x * c(5e-16, 1e-15) / eps)
    PS> if (x > 1e62) { k <- rev( - k) }
    PS> y <- x + k[1]:k[2] * eps
    PS> ind <- which(diff(as.numeric(as.character(y))) < 0)
    PS> for (j in ind) {
    PS> u1 <- y[c(j, j+1)]
    PS> u2 <- factor(u1)
    PS> print(levels(u2))
    PS> print(diff(as.numeric(levels(u2))))
    PS> aux <- readline("next")
    PS> }
    PS> }

    PS> An example of the output is

    PS> [1] "1.2296427920313e+61"  "1.22964279203130e+61"
    PS> [1] -1.427248e+45
    PS> next
    PS> [1] "1.82328862326830e+62" "1.8232886232683e+62" 
    PS> [1] -2.283596e+46
    PS> next

    PS> The negative number in diff(as.numeric(levels(u2))) demonstrates cases,
    PS> when as.numeric(levels(u2)) is decreasing. We can also see that the reason
    PS> is that the two strings in levels(u2) differ in the trailing zeros.

    PS> I did quite intensive search for such examples for all possible exponents
    PS> (not only 61 and 62 and a week of CPU on three processors) and all the obtained
    PS> examples were caused by a difference in trailing zeros. So, i believe that
    PS> removing trailing zeros from the output of as.character(x) solves the problem
    PS> with the reversed order in as.numeric(levels(factor(x))) entirely.

    PS> A patch against R-devel_2009-05-22, which eliminates trailing zeros
    PS> from as.character(x), but makes no other changes to as.character(x),
    PS> is in an attachment. Using the patch, we obtain a better result also
    PS> in the following.

    PS> nums <- 0.3 + 2e-16 * c(-2,-1,1,2)
    PS> factor(nums)
    PS> # [1] 0.3 0.3 0.3 0.3
    PS> # Levels: 0.3

yes, of course.

    PS> Petr.

[.....]

I have very slightly  modified the changes (to get rid of -Wall
warnings) and also exported the function as Rf_dropTrailing0(),
and tested the result with 'make check-all' .
As the change seems reasonable and consequent, and as
it seems not to produce any problems in our tests, 
I'm hereby proposing to commit it (my version of it),
[to R-devel only] within a few days,
unless someone speaks up.

Martin Maechler, ETH Zurich

   PS> --- R-devel/src/main/coerce.c	2009-04-17 17:53:35.000000000 +0200
    PS> +++ R-devel-elim-trailing/src/main/coerce.c	2009-05-23 08:39:03.914774176 +0200
    PS> @@ -294,12 +294,33 @@
    PS> else return mkChar(EncodeInteger(x, w));
    PS> }
 
    PS> +const char *elim_trailing(const char *s, char cdec)
    PS> +{
    PS> +    const char *p;
    PS> +    char *replace;
    PS> +    for (p = s; *p; p++) {
    PS> +        if (*p == cdec) {
    PS> +            replace = (char *) p++;
    PS> +            while ('0' <= *p & *p <= '9') {
    PS> +                if (*(p++) != '0') {
    PS> +                    replace = (char *) p;
    PS> +                }
    PS> +            }
    PS> +            while (*(replace++) = *(p++)) {
    PS> +                ;
    PS> +            }
    PS> +            break;
    PS> +        }
    PS> +    }
    PS> +    return s;
    PS> +}
    PS> +
    PS> SEXP attribute_hidden StringFromReal(double x, int *warn)
    PS> {
    PS> int w, d, e;
    PS> formatReal(&x, 1, &w, &d, &e, 0);
    PS> if (ISNA(x)) return NA_STRING;
    PS> -    else return mkChar(EncodeReal(x, w, d, e, OutDec));
    PS> +    else return mkChar(elim_trailing(EncodeReal(x, w, d, e, OutDec), OutDec));
    PS> }
 
    PS> SEXP attribute_hidden StringFromComplex(Rcomplex x, int *warn)



More information about the R-devel mailing list