[Rd] Match .3 in a sequence

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Tue Mar 17 10:15:39 CET 2009


Petr Savicky wrote:
> On Mon, Mar 16, 2009 at 07:39:23PM -0400, Stavros Macrakis wrote:
> ...
>   
>> Let's look at the extraordinarily poor behavior I was mentioning. Consider:
>>
>> nums <- (.3 + 2e-16 * c(-2,-1,1,2)); nums
>> [1] 0.3 0.3 0.3 0.3
>>
>> Though they all print as .3 with the default precision (which is
>> normal and expected), they are all different from .3:
>>
>> nums - .3 =>  -3.885781e-16 -2.220446e-16  2.220446e-16  3.885781e-16
>>
>> When we convert nums to a factor, we get:
>>
>> fact <- as.factor(nums); fact
>> [1] 0.300000000000000 0.3               0.3               0.300000000000000
>> Levels: 0.300000000000000 0.3 0.3 0.300000000000000
>>
>> Not clear what the difference between 0.300000000000000 and 0.3 is
>> supposed to be, nor why some 0.300000000000000 are < .3 and others are
>>     
> ...
>
> When creating a factor from numeric vector, the list of levels and the
> assignment of original elements to the levels is done using
> double precision. Since the four elements in the vector are distinct,
> we get four distinct levels. After this is done, the levels attribute is
> formed using as.character(). This can map different numbers to the same
> string, so in the example above, this leads to a factor, which contains
> repeated levels.
>
> This part of the problem may be avoided using
>
>   fact <- as.factor(as.character(nums)); fact
>   [1] 0.300000000000000 0.3               0.3               0.300000000000000
>   Levels: 0.3 0.300000000000000
>
> The reason for having 0.300000000000000 and 0.3 is that as.character()
> works the same as printing with digits=15. The R printing mechanism
> works in two steps. In the first step it tries to determine the shortest 
> format needed to achieve the required relative precision of the output.
> This step uses an algorithm, which need not provide an accurate result.
> The next step is that the number is printed using C function sprintf
> with the chosen format. This step is accurate, so we cannot get wrong
> digits. We only can get wrong number of digits.
>
> In order to avoid using 15 digits in as.character(), we can use round(,digits),
> with digits argument appropriate for the current situation.
>
>   > fact <- as.factor(round(nums,digits=1)); fact
>   [1] 0.3 0.3 0.3 0.3
>   Levels: 0.3
>
>   

with the examples above, it looks like a design flaw that factor levels
and their *labels* are messed up into one clump.  if, in the above,
levels were the numbers, and their labels were produced with
as.character, as you show, but kept separately (or generated on the fly,
when displaying the factor), the problem would have been solved.  you
would then have something like:
  
    nums <- (.3 + 2e-16 * c(-2,-1,1,2)); nums   
    # [1] 0.3 0.3 0.3 0.3
   
    sum(nums[rep(1:4, each=4)] == nums[rep(1:4, 4)])
    # 4

    fact <- as.factor(nums); fact
    # [1] 0.300000000000000 0.3 0.3 0.300000000000000
    # Levels: 0.300000000000000 0.3 0.3 0.300000000000000
  
    sum(fact[rep(1:4, each=4)] == fact[rep(1:4, 4)])
    # 4 (currently, it's 8)
   
there's one more curiosity about factors, in particular, ordered factors:

    ord <- as.ordered(nums); ord
    # [1] 0.300000000000000 0.3               0.3              
0.300000000000000
    # Levels: 0.300000000000000 < 0.3 < 0.3 < 0.300000000000000

    ord[1] < ord[4]
    # TRUE
    ord[1] == ord[4]
    # TRUE

vQ



More information about the R-devel mailing list