[Rd] type.convert and doubles

Martin Maechler maechler at stat.math.ethz.ch
Tue Apr 22 09:42:11 CEST 2014


>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com>
>>>>>     on Mon, 21 Apr 2014 09:24:13 -0400 writes:

    > Agreed. Perhaps even a global option would make sense. We
    > already have an option with a similar spirit:
    > 'options(³stringsAsFactors"=T/F)'. Perhaps
    > 'options(³exactNumericAsString²=T/F)' [or something else]
    > would be desirable, with the option being the default
    > value to the type.convert argument.

No, please, no, not a global option here!

Global options that influence default behavior of basic
functions is too much against the principle of functional
programming, and my personal opinion has always been that
'stringsAsFactors' has been a mistake (as a global option, not
as an argument).

Note that with such global options, the output of sessionInfo()
would in principle have to contain all (such) global options in
addtion to R and package versions in order to diagnose behavior
of R functions.

I think we have more or less agreed that we'd like to have
a new function *argument* to type.convert(); 
passed "upstream" to read.table() and via ... the other
read.<foo>() that call read.table.


    > I also like Gabor¹s idea of a ³distinguishing class². R
    > doesn¹t natively support arbitrary precision numbers
    > (AFAIK), but I think that¹s what Murray wants. I could
    > imagine some kind of new class emerging here that
    > initially looks just like a character/factor, but may
    > evolve over time to accept arithmetic methods and act more
    > like a number (e.g. knowing that ³0.1², ³.10² and "1e-1"
    > are the same number, or that ³-9²<³-0.2"). A class
    > ³bignum² perhaps?

That's another interesting idea. As maintainer of CRAN package
'Rmpfr' and co-maintainer of 'gmp', I'm even biased about this
issue.

Martin

    > Cheers, Robert


    > On 4/20/14, 3:24 AM, "Murray Stokely" <murray at stokely.org>
    > wrote:

    >> Yes, I'm also strongly in favor of having an option for
    >> this.  If there was an option in base R for controlling
    >> this we would just use that and get rid of the separate
    >> RProtoBuf.int64AsString option we use in the RProtoBuf
    >> package on CRAN to control whether 64-bit int types from
    >> C++ are returned to R as numerics or character vectors.
    >> 
    >> I agree that reasonable people can disagree about the
    >> default, but I found my original bug report about this,
    >> so I will counter Robert's example with my favorite
    >> example of what was wrong with the previous behavior :
    >> 
    >> tmp<-data.frame(n=c("72057594037927936",
    >> "72057594037927937"), name=c("foo", "bar"))
    >> length(unique(tmp$n)) # 2 write.csv(tmp, "/tmp/foo.csv",
    >> quote=FALSE, row.names=FALSE) data <-
    >> read.csv("/tmp/foo.csv") length(unique(data$n)) # 1
    >> 
    >> - Murray
    >> 
    >> 
    >> On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
    >> <simon.urbanek at r-project.org> wrote:
    >>> On Apr 19, 2014, at 9:00 AM, Martin Maechler
    >>> <maechler at stat.math.ethz.ch> wrote:
    >>> 
    >>>>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com>
    >>>>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
    >>>> 
>>>>> This is all application specific and
>>>>> sort of beyond the scope of type.convert(), which now
    >>>> behaves as it
>>>>> has been documented to behave.
    >>>> 
    >>>>> That's only a true statement because the documentation
    >>>>> was changed to reflect the new behavior! The new
    >>>>> feature in type.convert certainly does not behave
    >>>>> according to the documentation as of R 3.0.3. Here's a
    >>>>> snippit:
    >>>> 
    >>>>> The first type that can accept all the non-missing
    >>>>> values is chosen (numeric and complex return values
    >>>>> will represented approximately, of course).
    >>>> 
    >>>>> The key phrase is in parentheses, which reminds the
    >>>>> user to expect a possible loss of precision. That
    >>>>> important parenthetical was removed from the
    >>>>> documentation in R 3.1.0 (among other changes).
    >>>> 
    >>>>> Putting aside the fact that this introduces a large
    >>>>> amount of unnecessary work rewriting SQL / data import
    >>>>> code, SQL packages, my biggest conceptual problem is
    >>>>> that I can no longer rely on a particular function
    >>>>> call returning a particular class. In my example
    >>>>> querying stock prices, about 5% of prices came back as
    >>>>> factors and the remaining 95% as numeric, so we had
    >>>>> random errors popping in throughout the morning.
    >>>> 
    >>>>> Here's a short example showing us how the new behavior
    >>>>> can be unreliable. I pass a character representation
    >>>>> of a uniformly distributed random variable to
    >>>>> type.convert. 90% of the time it is converted to
    >>>>> "numeric" and 10% it is a "factor" (in R 3.1.0). In
    >>>>> the 10% of cases in which type.convert converts to a
    >>>>> factor the leading non-zero digit is always a 9. So if
    >>>>> you were expecting a numeric value, then 1 in 10 times
    >>>>> you may have a bug in your code that didn't exist
    >>>>> before.
    >>>> 
>>>>> options(digits=16)
>>>>> cl <- NULL; for (i in 1:10000) cl[i] <-
    >>>>>> class(type.convert(format(runif(1))))
>>>>> table(cl)
    >>>>> cl factor numeric 990 9010
    >>>> 
    >>>> Yes.
    >>>> 
    >>>> Murray's point is valid, too.
    >>>> 
    >>>> But in my view, with the reasoning we have seen here,
    >>>> *and* with the well known software design principle of
    >>>> "least surprise" in mind, I also do think that the
    >>>> default for type.convert() should be what it has been
    >>>> for > 10 years now.
    >>>> 
    >>> 
    >>> I think there should be two separate discussions:
    >>> 
    >>> a) have an option (argument to type.convert and possibly
    >>> read.table) to enable/disable this behavior. I'm
    >>> strongly in favor of this.
    >>> 
    >>> b) decide what the default for a) will be. I have no
    >>> strong opinion, I can see arguments in both directions
    >>> 
    >>> But most importantly I think a) is better than the
    >>> status quo - even if the discussion about b) drags out.
    >>> 
    >>> Cheers, Simon
    >>> 
    >>> 
    >>>



More information about the R-devel mailing list