[Rd] type.convert and doubles

Tue Apr 29 10:58:14 CEST 2014

>>>>> peter dalgaard <pdalgd at gmail.com>
>>>>>     on Tue, 29 Apr 2014 09:32:21 +0200 writes:

    > On 28 Apr 2014, at 19:17 , Martin Maechler <maechler at stat.math.ethz.ch> wrote:
    >> 
    > [...snip...]

    >>>> I think there should be two separate discussions:
    >> 
    >>>> a) have an option (argument to type.convert and possibly
    >>>> read.table) to enable/disable this behavior. I'm strongly
    >>>> in favor of this.
    >> 
    >>> In my (not committed) version of R-devel, I now have
    >> 
    >>>> str(type.convert(format(1/3, digits=17), exact=TRUE))
    >>> Factor w/ 1 level "0.33333333333333331": 1
    >>>> str(type.convert(format(1/3, digits=17), exact=FALSE))
    >>> num 0.333
    >> 
    >>> where the 'exact' argument name has been ``imported'' from
    >>> the underlying C code.
    >> 
    >>> [ As we CRAN package writers know by now, arguments
    >>> nowadays can hardly be abbreviated anymore, and so I am
    >>> not open to longer alternative argument names, as someone
    >>> liking blind typing, I'm not fond of camel case or other
    >>> keyboard gymnastics (;-) but if someone has a great idea
    >>> for a better argument name.... ]
    >> 
    >>> Instead of only TRUE/FALSE, we could consider NA with
    >>> semantics "FALSE + warning" or also "TRUE + warning".
    >> 
    >> 
    >>>> b) decide what the default for a) will be. I have no
    >>>> strong opinion, I can see arguments in both directions
    >> 
    >>> I think many have seen the good arguments in both
    >>> directions.  I'm still strongly advocating that we value
    >>> long term stability higher here, and revert to more
    >>> compatibility with the many years of previous versions.
    >> 
    >>> If we'd use a default of 'exact=NA', I'd like it to mean
    >>> FALSE + warning, but would not oppose much to TRUE +
    >>> warning.
    >> 
    >> I have now committed svn rev 65507  --- to R-devel only for now ---
    >> the above:   exact = NA  is the default
    >> and it means  "warning + FALSE".
    >> 
    >> Interestingly, I currently get 5 identical warnings for one
    >> simple call, so there seems clearly room for optimization, and
    >> that is one main reason for this reason to not yet be migrated
    >> to 'R 3.1.0 patched'.

    > I actually think that the default should be the old behaviour. No warning, just potentially lose digits. If this gets a user in trouble, _then_ turn on the check for lost digits. 

    > After all, I think we had about one single use case, where lost digits caused trouble (I cannot even dig up what the case was - someone had, like, 20-digit ID labels, I reckon). In contrast, we have seen umpteen cases where people have exported floating point data to slightly beyond machine precision, "just in case", and relied on read.table() to do the sensible thing.

    > It's also an open question whether we really want to apply the same logic to doubles and integer inputs. 

a really good point.  From my cursory code reading it would not
look so obvious where to make the distinction without quite a
bit of more coding, but I may just have overlooked a good idea.

    > The whole change went in as (r62327)

    > "force type.convert to read e.g. 64-bit integers as strings/factors"

    > I, for one, did not expect that "e.g." would include 0.12345678901234567. My eyes were on the upcoming 3.0.0 release at that point, so I might not have noticed it anyway, but apparently noone lifted an eyebrow. It seems that this was deliberately postponed for 3.1.0, but for more than a year, noone actually exercised the code. 

    > -pd

    > BTW, "exact" is a horrible name for an option, how about digitloss=c("allow", "warn", "forbid")?

I've also thought quickly about switching to an "enumeration
type" with string options.

If we would distinguish integer and non-integer input (and
hexadecimal vs decimal which are already different code branches),
we would need more than three options anyway ...
and when I start thinking about the possibilities, I start to
see too many "desirable" possibilities, e.g.,

 digitloss="allow for non-integers, don't warn"
 digitloss="allow for non-integers, do warn"
 digitloss="forbid, don't warn"
 digitloss="forbid, do  warn"

etc... which would speak for a different approach, maybe with
yet another argument for dealing with "long integer" only.

OTOH, I don't feel like spending even considerably more time on
this, now,  unless others are willing to also help (coding + testing).

Martin