[Rd] suggestion for extending ?as.factor

Tue May 5 09:48:20 CEST 2009

On Mon, May 04, 2009 at 07:28:06PM +0200, Peter Dalgaard wrote:
> Petr Savicky wrote:
> > For this, we get
> > 
> >   > convert(0.3)
> >   [1] "0.3"
> >   > convert(1/3)
> >   [1] "0.3333333333333333" # 16 digits suffice
> >   > convert(0.12345)
> >   [1] "0.12345"
> >   > convert(0.12345678901234567)
> >   [1] "0.12345678901234566"
> >   > 0.12345678901234567 == as.numeric("0.12345678901234566")
> >   [1] TRUE
> > 
> > This algorithm is slower than a single call to sprintf("%.17g", x), but it
> > produces nicer numbers, if possible, and guarantees that the value is
> > always preserved.
> 
> Yes, but....
> 
> > convert(0.2+0.1)
> [1] "0.30000000000000004"

I am not sure whether this should be considered an error. Computing with decimal
numbers requires some care. If we want to get the result, which is the closest
to 0.3, it is better to use round(0.1+0.2, digits=1).

> I think that the real issue is that we actually do want almost-equal
> numbers to be folded together.

Yes, this is frequently needed. A related question of an approximate match
in a numeric sequence was discussed in R-devel thread "Match .3 in a sequence"
from March. In order to fold almost-equal numbers together, we need to specify
a tolerance to use. The tolerance depends on application. In my opinion, it is
hard to choose a tolerance, which could be a good default.

> The most relevant case I can conjure up
> is this (permutation testing):
> 
> > zz <- replicate(20000,sum(sample(sleep$extra,10)))
> > length(table(zz))
> [1] 427
> > length(table(signif(zz,7)))
> [1] 281

In order to obtain the correct result in this example, it is possible to use
  zz <- signif(zz,7)
as you suggest or
  zz <- round(zz, digits=1)
and use the resulting zz for all further calculations.

> Notice that the discrepancy comes from sums that really are identical
> values (in decimal arithmetic), but where the binary FP inaccuracy makes
> them slightly different.
> 
> [for a nice picture, continue the example with
> 
> > tt <- table(signif(zz,7))
> > plot(as.numeric(names(tt)),tt, type="h")

The form of this picture is not due to rounding errors. The picture may be
obtained even within an integer arithmetic as follows.

  ss <- round(10*sleep$extra)
  zz <- replicate(20000,sum(sample(ss,10)))
  tt <- table(zz)
  plot(as.numeric(names(tt)),tt, type="h")

The variation of the frequencies is due to two effects.

First, each individual value of the sum occurs with low probability, so 20000
replications do not suffice to get low variance of individual frequencies. Using
1000 repetitions of the code above, i obtained estimate of some of the probabilities.
The most frequent sums have probability approximately p=0.0089 for a single sample.
With n=20000 replications, we get the mean frequency p*n = 178 and standard deviation
sqrt(p*(1-p)*n) = 13.28216.

The other cause of variation of the frequencies is that even the true distribution of
the sums has a lot of local minima and maxima. The mean of 1000 repetitions of the above
table restricted to values of the sum in the interval 140:168 produced the estimates

  value mean frequency (over 1000 tables)
  140   172.411
  141   172.090
  142   174.297
  143   166.039
  144   159.260
  145   163.891
  146   162.317
  147   165.460
  148   177.870
  149   177.971
  150   177.754
  151   178.525 local maximum
  152   169.851
  153   164.443 local minimum
  154   168.488 the mean value of the sum
  155   164.816 local minimum
  156   169.297
  157   179.248 local maximum
  158   177.799
  159   176.743
  160   177.777
  161   164.173
  162   162.585
  163   164.641
  164   159.913
  165   165.932
  166   173.014
  167   172.276
  168   171.612
The local minima and maxima are visible here. The mean value 154 is approximately
the center of the histogram.

Petr.