[Rd] suggestion for extending ?as.factor
Michael Dewey
info at aghmed.fsnet.co.uk
Sat May 9 15:54:40 CEST 2009
At 14:18 08/05/2009, Martin Maechler wrote:
> >>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
> >>>>> on Fri, 8 May 2009 11:01:55 +0200 writes:
Somewhere below Martin asks for alternatives from list readers. I do
not have alternatives, but I do have two comments, one immediately
below this, the other embedded in-line.
This whole thread reminds me just why I have spent the best part of a
decade climbing the virtual Matterhorn called 'Learning R' and why it
is such a pleasure to use. It is the fact that somebody, somewhere
cares enough about consistency, usability and accuracy to devote
hours to getting even obscure details just right.
> PS> On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
> PD> I think that the real issue is that we actually do want almost-equal
> PD> numbers to be folded together.
> >>
> >> yes, this now (revision 48469) will happen by default,
> using signif(x, 15)
> >> where '15' is the default for the new optional argument 'digitsLabels'
> >> {better argument name? (but must nost start with 'label')}
>
> PS> Let me analyze the current behavior of factor(x) for
> numeric x with missing(levels)
> PS> and missing(labels). In this situation, levels are computed
> as sort(unique(x))
> PS> from possibly transformed x. Then, labels are constructed
> by a conversion of the
> PS> levels to strings.
>
> PS> I understand the current (R 2.10.0, 2009-05-07 r48492)
> behavior as follows.
>
> PS> If keepUnique is FALSE (the default), then
> PS> - values x are transformed by signif(x, digitsLabels)
> PS> - labels are computed using as.character(levels)
> PS> - digitsLabels defaults to 15, but may be set to any integer value
>
> PS> If keepUnique is TRUE, then
> PS> - values x are preserved
> PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
> PS> - digitsLabels defaults to 17, but may be set to any integer value
>
>(in theory; in practice, I think I've suggested somewhere that
> it should be >= 17; but see below.)
>
>Your summary seems correct to me.
>
> PS> There are several situations, when this approach produces
> duplicated levels.
> PS> Besides the one described in my previous email, there are also others
> PS> factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)
>
>yes, but this is not much sensical; I've already contemplated
>to produce a warning in such cases, something like
>
> if(keepUnique && digitsLabels < 17)
> warning(gettextf(
> "'digitsLabels = %d' is typically too small when 'keepUnique' is true",
> digitsLabels))
>
>
> PS> factor(1 + 0:5 * 1e-16, digitsLabels=17)
>
>again, this does not make much sense; but why disallow the useR
>to shoot into his foot?
I agree. As a useR I do not want to be stopped from doing anything. I
would appreciate a warning just before I shoot myself in the foot and
I definitely want one if it looks like I am going to aim for my head.
> PS> I would like to suggest a modification. It eliminates most
> of the cases, where
> PS> we get duplicated levels. It would eliminate all such
> cases, if the function
> PS> signif() works as expected. Unfortunately, if signif()
> works as it does in the
> PS> current versions of R, we still get duplicated levels.
>
> PS> The suggested modification is as follows.
>
> PS> If keepUnique is FALSE (the default), then
> PS> - values x are transformed by signif(x, digitsLabels)
> PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
> PS> - digitsLabels defaults to 15, but may be set to any integer value
>
>I tend like this change, given -- as you found yesterday -- that
>as.character() is not even preserving 15 digits.
>OTOH, as.character() has been in use for a very long history of
>S (and R), whereas using sprintf() is not back compatible with
>it and actually depends on the LIBC implementation of the system-sprintf.
>For that reason as.character() would be preferable.
>Hmm....
>
> PS> If keepUnique is TRUE, then
> PS> - values x are preserved
> PS> - labels are computed using sprintf("%.*g", 17, levels)
> PS> - digitsLabels is ignored
>
>I had originally planned to do exactly the above.
>However, e.g., digitsLabels = 18 may be desired in some cases,
>and that's why I also left the possibility to apply it in the
>keepUnique case.
>
>
> PS> Arguments for the modification are the following.
>
> PS> 1. If keepUnique is FALSE, then computing labels using
> as.character() leads
> PS> to duplicated labels as demonstrated in my previous email.
> So, i suggest to
> PS> use sprintf("%.*g", digitsLabels, levels) instead of as.character().
>
>{as said above, that seems sensible, though unfurtunately quite
> a bit less back-compatible!}
>
> PS> 2. If keepUnique is TRUE and we allow digitsLabels less
> than 17, then we get
> PS> duplicated labels. So, i suggest to force digitsLabels=17,
> if keepUnique=TRUE.
>
> PS> If signif(,digitsLabels) works as expected, than the above
> approach should not
> PS> produce duplicated labels. Unfortunately, this is not the case.
> PS> There are numbers, which remain different in signif(x, 16),
> but are mapped
> PS> to the same string in sprintf("%.*g", 16, x). Examples of
> this kind may be
> PS> found using the script
>
> PS> for (i in 1:50) {
> PS> x <- 10^runif(1, 38, 50)
> PS> y <- x * (1 + 0:500 * 1e-16)
> PS> y <- unique(signif(y, 16))
> PS> z <- unique(sprintf("%.16g", y))
> PS> stopifnot(length(y) == length(z))
> PS> }
>
> PS> This script is tested on Intel default arithmetic and on
> Intel with SSE.
>
> PS> Perhaps, digitsLabels = 16 could be forbidden, if
> keepUnique is FALSE.
>
> PS> Unfortunately, a similar problem occurs even for
> digitsLabels = 15, although for
> PS> much larger numbers.
>
> PS> for (i in 1:200) {
> PS> x <- 10^runif(1, 250, 300)
> PS> y <- x * (1 + 0:500 * 1e-16)
> PS> y <- unique(signif(y, 15))
> PS> z <- unique(sprintf("%.15g", y))
> PS> stopifnot(length(y) == length(z))
> PS> }
>
> PS> This script finds collisions, if SSE is enabled, on two
> PS> Intel computers, where i did the test. Without SSE, it
> PS> finds collisions only on one of them. May be, it depends
> PS> also on the compiler, which is different.
>
>probably rather on the exact implementation of the underlying C
>library ("LIBC").
>
>Thank you, Petr, for your investigations.
>We all see that the simple requirement of
> *no more duplicate factor levels !*
>leads to considerable programming efforts for the case of
>factor(<numeric>, .).
>
>One prominent R-devel reader actually proposed to me in private,
>that factor(<numeric>, .) should give a *warning* by default,
>since he considered it unsafe practice.
>
>Note that your last investigations show that your (two) proposed
>changes actually do *not* solve the problem entirely;
>further note that (at least inside the sources), we now say that
>duplicate levels will not just signal a warning, but an error in
>the future.
>As long as we don't want to allow factor(<numeric>) to fail --rarely --
>I think (and that actually has been a recurring daunting thought
>for quite a few days) that we probably need an
>extra step of checking for duplicate levels, and if we find
>some, recode "everything". This will blow up the body of the
>factor() function even more.
>
>What alternatives do you (all R-devel readers!) see?
>
>Martin
>
>______________________________________________
>R-devel at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-devel
Michael Dewey
http://www.aghmed.fsnet.co.uk
More information about the R-devel
mailing list