[Rd] suggestion for extending ?as.factor
Petr Savicky
savicky at cs.cas.cz
Fri May 8 11:01:55 CEST 2009
On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
> PD> I think that the real issue is that we actually do want almost-equal
> PD> numbers to be folded together.
>
> yes, this now (revision 48469) will happen by default, using signif(x, 15)
> where '15' is the default for the new optional argument 'digitsLabels'
> {better argument name? (but must nost start with 'label')}
Let me analyze the current behavior of factor(x) for numeric x with missing(levels)
and missing(labels). In this situation, levels are computed as sort(unique(x))
from possibly transformed x. Then, labels are constructed by a conversion of the
levels to strings.
I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows.
If keepUnique is FALSE (the default), then
- values x are transformed by signif(x, digitsLabels)
- labels are computed using as.character(levels)
- digitsLabels defaults to 15, but may be set to any integer value
If keepUnique is TRUE, then
- values x are preserved
- labels are computed using sprintf("%.*g", digitsLabels, levels)
- digitsLabels defaults to 17, but may be set to any integer value
There are several situations, when this approach produces duplicated levels.
Besides the one described in my previous email, there are also others
factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)
factor(1 + 0:5 * 1e-16, digitsLabels=17)
I would like to suggest a modification. It eliminates most of the cases, where
we get duplicated levels. It would eliminate all such cases, if the function
signif() works as expected. Unfortunately, if signif() works as it does in the
current versions of R, we still get duplicated levels.
The suggested modification is as follows.
If keepUnique is FALSE (the default), then
- values x are transformed by signif(x, digitsLabels)
- labels are computed using sprintf("%.*g", digitsLabels, levels)
- digitsLabels defaults to 15, but may be set to any integer value
If keepUnique is TRUE, then
- values x are preserved
- labels are computed using sprintf("%.*g", 17, levels)
- digitsLabels is ignored
Arguments for the modification are the following.
1. If keepUnique is FALSE, then computing labels using as.character() leads
to duplicated labels as demonstrated in my previous email. So, i suggest to
use sprintf("%.*g", digitsLabels, levels) instead of as.character().
2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get
duplicated labels. So, i suggest to force digitsLabels=17, if keepUnique=TRUE.
If signif(,digitsLabels) works as expected, than the above approach should not
produce duplicated labels. Unfortunately, this is not the case.
There are numbers, which remain different in signif(x, 16), but are mapped
to the same string in sprintf("%.*g", 16, x). Examples of this kind may be
found using the script
for (i in 1:50) {
x <- 10^runif(1, 38, 50)
y <- x * (1 + 0:500 * 1e-16)
y <- unique(signif(y, 16))
z <- unique(sprintf("%.16g", y))
stopifnot(length(y) == length(z))
}
This script is tested on Intel default arithmetic and on Intel with SSE.
Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE.
Unfortunately, a similar problem occurs even for digitsLabels = 15, although for
much larger numbers.
for (i in 1:200) {
x <- 10^runif(1, 250, 300)
y <- x * (1 + 0:500 * 1e-16)
y <- unique(signif(y, 15))
z <- unique(sprintf("%.15g", y))
stopifnot(length(y) == length(z))
}
This script finds collisions, if SSE is enabled, on two Intel computers, where i did
the test. Without SSE, it finds collisions only on one of them. May be, it depends
also on the compiler, which is different.
Petr.
More information about the R-devel
mailing list