[R] strangely long floating point with write.table()

Sun Mar 16 07:13:04 CET 2014

On Sat, 15 Mar 2014, peter dalgaard wrote:

> On 15 Mar 2014, at 20:54 , Mike Miller <mbmiller+l at gmail.com> wrote:
>
>> $ cat data1.txt
>> 0.005
>> 0.00499999999999989
>>
>> I don't know why it shows 17 digits and doesn't round to 15, but it is showing that the numbers are different, for some reason.
>>
>
> Aiding my weakening eyesight a little:
>
> 0.004 999 999 999 999 89
>
> Notice that that makes 15 _significant_ digits.

OK, now I feel really stupid.  Of course it's 15 mantissa digits, not 15 
%f digits, or whatever that should be called.  Sorry about that.

>> Do you understand why there is a difference between 1-0.995 and 2-1.995 
>> in their internal representations?
>
> Let's see,  that'll be like
>
> 1 - 2/3 vs. 10 - 29/3
>
> on a decimal computer if someone is perverse enough to give input in 
> base 3 (i.e., 1.0 - 0.2 ternary vs. 101.0 - 100.2 ternary). Assume that 
> the computer is floating point with 3 significant digits (and possibly 
> taking some liberties compared to what real computers really do), we 
> have
>
>   1 = 1.000 * 10^0
>  10 = 1.000 * 10^1
> 2/3 = 0.667 * 10^0
> 29/3 = 0.967 * 10^1
>
> 1 - 2/3  = 0.333 * 10^0
> 10 - 29/3 = 0.033 * 10^1 = 0.330 * 10^0
>
> So, yes, I think I do understand how these things can happen.

Yes, and that's a nice explanation, but you had me at "_significant_".  I 
don't know why I didn't get that in the first place.  So the difference in 
my example is that 0.995 is 9.950e-1 so that the 5 is the third 
significant digit and in 1.995, the 5 is the fourth significant digit, so 
1-0.995 provides a more precise representation of 0.005 than does 2-1.995.

I always knew there was some numerical reason why I was getting very long 
stretches of 9s or 0s in the write.table() output, but my concern is 
really with how to prevent that from happening.  So the question still is, 
how do I avoid getting 0.00499999999999989 in my output file when I want 
0.005?  I'm sure I'm not alone in this.  It looks like the standard answer 
is to use format().  For example, I could do this:

> write.table(format(data, digits=13, trim=T), file="data.txt", row.names=F, col.names=F, quote=F)

That does fix the long numbers -- all of them are reduced to three digits. 
The one thing that concerns me is that when format() is called, isn't it 
making an object that could take up a lot of memory if the data frame is 
large?  The data frame created by format() might use a lot more memory 
than the original data frame if it is converting a lot of doubles (8 
bytes) to a lot of possibly 16-byte strings.  For example, -10/81 takes up 
8 bytes as a double, but converted by format with digits=13 it uses 16 
bytes to include the sign, the zero and the decimal point (plus a 
delimiter when there are many per line of output):

> write.table(format(-10/81, digits=13), row.names=F, col.names=F, quote=F)
-0.1234567901235

I'm assuming that write.table() is streaming the data into a file (or 
stdout) and not creating a complete representation of the output in memory 
before it does that.  It looks like format() creates a data frame where 
all variables are converted to character type.  Thus, it wouldn't be just 
for convenience that one might want digits=N to be an option in the 
write.table() function.  It would be very useful with large data frames, 
making it possible to write out things that would be too large to handle 
using format().  When files are already super-large, we really want to 
avoid expanding the number of digits per value in the output.

Mike