[Rd] Question re: NA, NaNs in R

Tue Feb 11 05:52:17 CET 2014

Hi Duncan,

Thanks a ton -- I appreciate your taking the time to investigate this,
and especially even checking into the IEEE standard to clarify.

Cheers,
Kevin

On Mon, Feb 10, 2014 at 11:54 AM, Rainer M Krug <Rainer at krugs.de> wrote:
>
>
> On 02/10/14, 19:07 , Duncan Murdoch wrote:
>> On 10/02/2014 10:21 AM, Tim Hesterberg wrote:
>>> This isn't quite what you were asking, but might inform your choice.
>>>
>>> R doesn't try to maintain the distinction between NA and NaN when
>>> doing calculations, e.g.:
>>> > NA + NaN
>>> [1] NA
>>> > NaN + NA
>>> [1] NaN
>>> So for the aggregate package, I didn't attempt to treat them differently.
>>
>> This looks like a bug to me.  In 32 bit 3.0.2 and R-patched I see
>>
>>> NA + NaN
>> [1] NA
>>> NaN + NA
>> [1] NA
>
> But under 3.0.2 patched 64 bit on Maverick:
>
>> version
>                _
> platform       x86_64-apple-darwin10.8.0
> arch           x86_64
> os             darwin10.8.0
> system         x86_64, darwin10.8.0
> status         Patched
> major          3
> minor          0.2
> year           2014
> month          01
> day            07
> svn rev        64692
> language       R
> version.string R version 3.0.2 Patched (2014-01-07 r64692)
> nickname       Frisbee Sailing
>> NA+NaN
> [1] NA
>> NaN+NA
> [1] NaN
>
>>
>> This seems more reasonable to me.  NA should propagate.  (I can see an
>> argument for NaN for the answer here, as I can't think of any possible
>> non-missing value that would give anything else when added to NaN, but
>> the answer should not depend on the order of operands.)
>>
>> However, I get the same as you in 64 bit 3.0.2.  All calculations I've
>> shown are on 64 bit Windows 7.
>>
>> Duncan Murdoch
>>
>>
>>>
>>> The aggregate package is available at
>>> http://www.timhesterberg.net/r-packages
>>>
>>> Here is the inst/doc/missingValues.txt file from that package:
>>>
>>> --------------------------------------------------
>>> Copyright 2012 Google Inc. All Rights Reserved.
>>> Author: Tim Hesterberg <rocket at google.com>
>>> Distributed under GPL 2 or later.
>>>
>>>
>>>     Handling of missing values and not-a-numbers.
>>>
>>>
>>> Here I'll note how this package handles missing values.
>>> I do it the way R handles them, rather than the more strict way that
>>> S+ does.
>>>
>>> First, for terminology,
>>>    NaN = "not-a-number", e.g. the result of 0/0
>>>    NA  = "missing value" or "true missing value", e.g. survey
>>> non-response
>>>    xx  = I'll uses this for the union of those, or "missing value of
>>> any kind".
>>>
>>> For background, at the hardware level there is an IEEE standard that
>>> specifies that certain bit patterns are NaN, and specifies that
>>> operations involving an NaN result in another NaN.
>>>
>>> That standard doesn't say anything about missing values, which are
>>> important in statistics.
>>>
>>> So what R and S+ do is to pick one of the bit patterns and declare
>>> that to be a NA.  In other words, the NA bit pattern is a subset of
>>> the NaN bit patterns.
>>>
>>> At the user level, the reverse seems to hold.
>>> You can assign either NA or NaN to an object.
>>> But:
>>>     is.na(x) returns TRUE for both
>>>     is.nan(x) returns TRUE for NaN and FALSE for NA
>>> Based on that, you'd think that NaN is a subset of NA.
>>> To tell whether something is a true missing value do:
>>>     (is.na(x) & !is.nan(x))
>>>
>>> The S+ convention is that any operation involving NA results in an NA;
>>> otherwise any operation involving NaN results in NaN.
>>>
>>> The R convention is that any operation involving xx results in an xx;
>>> a missing value of any kind results in another missing value of any
>>> kind.  R considers NA and NaN equivalent for testing purposes:
>>>     all.equal(NA_real_, NaN)
>>> gives TRUE.
>>>
>>> Some R functions follow the S+ convention, e.g. the Math2 functions
>>> in src/main/arithmetic.c use this macro:
>>> #define if_NA_Math2_set(y,a,b)                \
>>>     if      (ISNA (a) || ISNA (b)) y = NA_REAL;    \
>>>     else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
>>>
>>> Other R functions, like the basic arithmetic operations +-/*^,
>>> do not (search for PLUSOP in src/main/arithmetic.c).
>>> They just let the hardware do the calculations.
>>> As a result, you can get odd results like
>>> > is.nan(NA_real_ + NaN)
>>> [1] FALSE
>>> > is.nan(NaN + NA_real_)
>>> [1] TRUE
>>>
>>> The R help files help(is.na) and help(is.nan) suggest that
>>> computations involving NA and NaN are indeterminate.
>>>
>>> It is faster to use the R convention; most operations are just
>>> handled by the hardware, without extra work.
>>>
>>> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
>>> and NaN are removed.
>>>
>>>
>>>
>>>
>>> >There is one NA but mulitple NaNs.
>>> >
>>> >And please re-read 'man memcmp': your cast is wrong.
>>> >
>>> >On 10/02/2014 06:52, Kevin Ushey wrote:
>>> >> Hi R-devel,
>>> >>
>>> >> I have a question about the differentiation between NA and NaN values
>>> >> as implemented in R. In arithmetic.c, we have
>>> >>
>>> >> int R_IsNA(double x)
>>> >> {
>>> >>      if (isnan(x)) {
>>> >> ieee_double y;
>>> >> y.value = x;
>>> >> return (y.word[lw] == 1954);
>>> >>      }
>>> >>      return 0;
>>> >> }
>>> >>
>>> >> ieee_double is just used for type punning so we can check the final
>>> >> bits and see if they're equal to 1954; if they are, x is NA, if
>>> >> they're not, x is NaN (as defined for R_IsNaN).
>>> >>
>>> >> My question is -- I can see a substantial increase in speed (on my
>>> >> computer, in certain cases) if I replace this check with
>>> >>
>>> >> int R_IsNA(double x)
>>> >> {
>>> >>      return memcmp(
>>> >>          (char*)(&x),
>>> >>          (char*)(&NA_REAL),
>>> >>          sizeof(double)
>>> >>      ) == 0;
>>> >> }
>>> >>
>>> >> IIUC, there is only one bit pattern used to encode R NA values, so
>>> >> this should be safe. But I would like to be sure:
>>> >>
>>> >> Is there any guarantee that the different functions in R would return
>>> >> NA as identical to the bit pattern defined for NA_REAL, for a given
>>> >> architecture? Similarly for NaN value(s) and R_NaN?
>>> >>
>>> >> My guess is that it is possible some functions used internally by R
>>> >> might encode NaN values differently; ie, setting the lower word to a
>>> >> value different than 1954 (hence being NaN, but potentially not
>>> >> identical to R_NaN), or perhaps this is architecture-dependent.
>>> >> However, NA should be one specific bit pattern (?). And, I wonder if
>>> >> there is any guarantee that the different functions used in R would
>>> >> return an NaN value as identical to R_NaN (which appears to be the
>>> >> 'IEEE NaN')?
>>> >>
>>> >> (interested parties can see + run a simple benchmark from the gist at
>>> >> https://gist.github.com/kevinushey/8911432)
>>> >>
>>> >> Thanks,
>>> >> Kevin
>>> >>
>>> >> ______________________________________________
>>> >> R-devel at r-project.org mailing list
>>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> >>
>>> >
>>> >
>>> >--
>>> >Brian D. Ripley,                  ripley at stats.ox.ac.uk
>>> >Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>>> >University of Oxford,             Tel:  +44 1865 272861 (self)
>>> >1 South Parks Road,                     +44 1865 272866 (PA)
>>> >Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
> Biology, UCT), Dipl. Phys. (Germany)
>
> Centre of Excellence for Invasion Biology
> Stellenbosch University
> South Africa
>
> Tel :       +33 - (0)9 53 10 27 44
> Cell:       +33 - (0)6 85 62 59 98
> Fax :       +33 - (0)9 58 10 27 44
>
> Fax (D):    +49 - (0)3 21 21 25 22 44
>
> email:      Rainer at krugs.de
>
> Skype:      RMkrug
>