[Rd] complex NA's match(), etc: not back-compatible change proposal

Sat May 28 11:34:08 CEST 2016

On 'factor', I meant the case where 'levels' is not specified, where 'unique' is called.

> factor(c(complex(real=NaN), complex(imaginary=NaN)))
[1] NaN+0i <NA>
Levels: NaN+0i

Look at <NA> in the result above. Yes, it happens in earlier versions of R, too.

On matching both NA and NaN, another consequence is that length(unique(.)) may depend on order. Example using R devel r70604:

> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0)
> (z <- z[is.na(z)])
 [1]       NA NaN+  0i       NA NaN+  1i       NA       NA       NA       NA
 [9]   0+NaNi   1+NaNi       NA NaN+NaNi
> length(print(unique(z)))
[1]     NA NaN+0i
[1] 2
> length(print(unique(c(z[8], z[-8]))))
[1] NA
[1] 1
--------------------------------------------
On Mon, 23/5/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

 Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal

 Cc: R-devel at r-project.org
 Date: Monday, 23 May, 2016, 11:06 PM

 >>>>>
 Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
 >>>>>     on Fri, 13
 May 2016 16:33:05 +0000 writes:

     > That, for example, complex(real=NaN)
 and complex(imaginary=NaN) are regarded as equal makes it
 possible that 

     > 
 length(unique(as.character(x))) > length(unique(x)) 

     > (current code of
 function 'factor' doesn't expect it). 

 Thank you, that is an
 interesting remark - but is already true,
 in
[[elided Yahoo spam]]

 ..
 and of course this is because we do
 *print*   0+NaNi  etc,
 i.e., we
 differentiate the  non-NA-but-NaN complex values in
 formatting / printing but not in match(),
 unique() ...

 and indeed,
 with the  'z'  example below,

 fz <- factor(z,z)
 gives a warnings about
 duplicated levels and gives such warnings
 also in current (and previous) versions of R,
 at least for the slightly
 larger z 
 I've used in the tests/reg-tests-1c.R example.

 For the moment I can live with
 that warning, as I don't think
 factor()s
 are constructed from complex numbers "often"...
 and the performance of factor() in the more
 regular cases is important.

 > Yes, an argument for the behavior is that
 NA and NaN are of one kind.
 > On my
 system, using 32-bit R for Windows from binary from CRAN,
 the result of sapply(z, match, table = z) (not in current
 R-devel) may be different from below:

 > 1 2 3 4 1 3 7 8 2 4 8 12  # R 2.10.1, different from
 below
     > 1 2 3 4 1 3 7 8 2 4 8 12 
 # R 3.2.5, different from below

 interesting, thank you... and another reason
 why the change
 (currently only in R-devel)
 may have been a good one: More uniformity.

     > I noticed that, by
 function 'cequal' in unique.c, a complex number that
 has both NA and NaN matches NA and also matches NaN.

     >> x0 <- c(0,1,
 NA, NaN); z <- outer(x0,x0, complex, length.out=1);
 rm(x0)
     >> (z <-
 z[is.na(z)])
     > [1]   
    NA NaN+  0i       NA NaN+  1i 
      NA       NA   
    NA       NA

 > [9]   0+NaNi   1+NaNi   
    NA NaN+NaNi

     >> sapply(z, match, table =
 z[8])
     > [1] 1 1 1 1 1 1 1 1 1 1 1
 1
     >> match(z, z[8])
     > [1] 1 1 1 1 1 1 1 1 1 1 1 1

 Yes, I see the same. But is
 n't it what we expect:

 All of our z[] entries has at least one NA or a
 NaN in its real
 or imaginary, and since z[8]
 has both, it does match with all
 z[]'s
 either because of the NA or because of the NaN in common.

 Hence, currently, I don't
 think this needs to be changed...
 but if
 there are other reasons / arguments ...

 Thank you again,
 Martin
 Maechler

     >> sessionInfo()

   > R Under development (unstable) (2016-05-12
 r70604)
     > Platform:
 i386-w64-mingw32/i386 (32-bit)
     >
 Running under: Windows XP (build 2600) Service Pack 2

     > locale:
     > [1] LC_COLLATE=English_United
 States.1252
     > [2]
 LC_CTYPE=English_United States.1252

 > [3] LC_MONETARY=English_United States.1252
     > [4] LC_NUMERIC=C

   > [5] LC_TIME=English_United States.1252

     > attached base
 packages:
     > [1] stats 
    graphics  grDevices utils 
    datasets  methods   base

     >
 -----------------
 >>>>>
 Martin Maechler <maechler at stat.math.ethz.ch>
 >>>>>     on Tue, 10
 May 2016 16:08:39 +0200 writes:

     >> This is an RFC / announcement
 related to the 2nd part of PR#16885

 >> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885
     >> about  complex NA's.

     >> The (somewhat
 rare) incompatibility in R's 3.3.0 match() behavior for
 the
     >> case of complex numbers
 with NA & NaN's {which has been fixed for R 3.3.0
     >> patched in the mean time}
 triggered some more comprehensive "research".

     >> I found that we
 have had a long-standing inconsistency at least between
 the
     >> documented and the real
 behavior.  I am claiming that the documented
     >> behavior is desirable and hence
 R's current "real" behavior is bugous, and
     >> I am proposing to change it, in
 R-devel (to be 3.4.0) for now.

     > After the  "roaring
 unanimous" assent  (one private msg

   > encouraging me to go forward, no dissenting voice,
 hence an
     > "odds ratio"
 of  +Inf  in favor ;-)

   > I have now committed my proposal to R-devel (svn
 rev. 70597) and
     > some of us will
 be seeing the effect in package space within a
     > day or so, in the CRAN checks
 against R-devel (not for
     >
 bioconductor AFAIK; their checks using R-devel only when it
 less
     > than ca 6 months from
 release).

     >
 It's still worthwhile to discuss the issue, if you come
 late
     > to it, notably as
 ---paraphrasing Dirk on the R-package-devel list---
     > the release of 3.4.0 is almost a
 year away, and so now is the
     > best
 time to tinker with the API, in other words, consider
 breaking
     > rarely used legacy
 APIs..

     > Martin

 >> In help(match) we have been saying

     >> |  Exactly
 what matches what is to some extent a matter of
 definition.
     >> |  For all
 types, \code{NA} matches \code{NA} and no other value.
     >> |  For real and complex values,
 \code{NaN} values are regarded

 >> |  as matching any other \code{NaN} value, but not
 matching \code{NA}.

 >> for at least 10 years.  But we don't do that
 at all in the
     >> complex case
 (and AFAIK never got a bug report about it).

     >> Also, e.g.,
 print(.) or format(.) do simply use  "NA" for
 all
     >> the different complex
 NA-containing numbers, where OTOH,

 >> non-NA NaN's { <=>  !is.nan(z) &
 is.na(z) }
     >> in format() or
 print() do show the NaN in real and/or imaginary
     >> parts; for an example, look at
 the "format" column of the matrix
     >> below, after
 'print(cbind' ...

     >> The current match()---and
 duplicated(), unique() which are based on the same
     >> C code---*do* distinguish almost
 all complex NA / NaN's which is

 >> NOT according to documentation. I have found that
 this is just because of 
     >> of
 our hashing function for the complex case, chash() in
 R/src/main/unique.c,
     >> is
 bogous in the sense that it is not compatible with the above
 documentation
     >> and also not
 with the cequal() function (in the same file uniqu.c) for
 checking
     >> equality of complex
 numbers.

     >> As
 I have found,, a *simplified* version of the chash()
 function
     >> to make it
 compatible with cequal() does solve all the problems
 I've
     >> indicated,  and the
 current plan is to commit that change --- after some
     >> discussion time, here on R-devel
 ---  to the code base.

   >> My change passes  'make check-all'
 fine, but I'm 100% sure that there will
     >> be effects in package-space. ...
 one reason for this posting.

     >> As mentioned above, note that
 the chash() function has been in

 >> use for all three functions

 >> match()
     >>
 duplicated()
     >> unique()
     >> and the change will affect all
 three --- but just for the case of complex
     >> vectors with NA or NaN's.

     >> To show more, a
 small R session -- using my version of R-devel
     >> == the proposition: 
     >> The R script
 ('complex-NA-short.R') for (a bit more than) the
     >> session is attached {{you can
 attach  text/plain easily}}:

     >>> x0 <- c(0,1, NA, NaN); z
 <- outer(x0,x0, complex, length.out=1); rm(x0)
     >>> ##       
    --- = NA_real_  but that does not exist e.g.,
 in R 2.3.1
     >>> ##       
            similarly,  '1L',
 '2L', .. do not exist e.g., in R 2.3.1
     >>> (z <- z[is.na(z)])
     >> [1]       NA NaN+ 
 0i       NA NaN+  1i       NA 
      NA       NA   
    NA
     >>
 [9]   0+NaNi   1+NaNi   
    NA NaN+NaNi
     >>>
 outerID <- function(x,y, ...) { ## ugly; can we get
 outer() to work ?
     >> + 
    r <- matrix( , length(x), length(y))
     >> +     for(i in
 seq(along=x))
     >> +     
    for(j in seq(along=y))

 >> +             r[i,j] <-
 identical(z[i], z[j], ...)
     >>
 +     r
     >> + }
     >>> ## Very strictly - in the
 sense of identical() -- these 12 complex numbers all
 differ:
     >>> ## a version that
 works in older versions of R, where identical() had fewer
 arguments!
     >>> outerID.picky
 <- function(x,y) {
     >> + 
    nF <- length(formals(identical)) - 2
     >> + 
    do.call("outerID", c(list(x, y),
 as.list(rep(FALSE, nF))))
     >> +
 }
     >>> oldR <-
 !exists("getRversion") || getRversion() <
 "3.0.0" ## << FIXME: 3.0.0 is  a wild
 guess
     >>> symnum(id.z <-
 outerID.picky(z,z)) ## == Diagonal matrix [newer versions of
 R]

     >> [1,] | . . . .
 . . . . . . .
     >> [2,] . | . . .
 . . . . . . .
     >> [3,] . . | . .
 . . . . . . .
     >> [4,] . . . | .
 . . . . . . .
     >> [5,] . . . . |
 . . . . . . .
     >> [6,] . . . . .
 | . . . . . .
     >> [7,] . . . . .
 . | . . . . .
     >> [8,] . . . . .
 . . | . . . .
     >> [9,] . . . . .
 . . . | . . .
     >> [10,] . . . . .
 . . . . | . .
     >> [11,] . . . . .
 . . . . . | .
     >> [12,] . . . . .
 . . . . . . |
     >>> try(# for
 older R versions
     >> +
 stopifnot(identical(id.z, outerID(z,z)), oldR ||
 identical(id.z, diag(12) == 1))

 >> + )
     >>> (mz <-
 match(z, z)) # currently different {NA,NaN} patterns differ
 - not in print()/format() _FIXME_

 >> [1] 1 2 1 2 1 1 1 1 2 2 1 2

 >>> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see
 the pattern :
     >>>
 print(cbind(format = format(z), t(zRI), mz), quote=FALSE)
     >>
 format   Re   Im   mz
     >> [1,]       NA
 <NA> 0    1 
     >> [2,]
 NaN+  0i NaN  0    2 
     >>
 [3,]       NA <NA> 1    1 
     >> [4,] NaN+  1i NaN  1    2

     >> [5,]       NA
 0    <NA> 1 
     >> [6,] 
      NA 1    <NA> 1 

   >> [7,]       NA <NA> <NA>
 1 
     >> [8,]       NA
 NaN  <NA> 1 
     >>
 [9,]   0+NaNi 0    NaN  2 

   >> [10,]   1+NaNi 1    NaN  2 
     >> [11,]       NA
 <NA> NaN  1 
     >> [12,]
 NaN+NaNi NaN  NaN  2 
     >>>

     >>
 -------------------------------

 >> Note that 'mz <- match(z, z)' and hence
 the last column of the matrix above

 >> are very different in current R, 
     >> distinguishing most kinds of NA
 / NaN  against the documentation (and the
     >> real/numeric case).

     >> Martin
 Maechler
     >> R Core Team

 >> ### Basically a shortened version of  the PR#16885
 -- complex part b)
     >> ### of 
 R/tests/reg-tests-1c.R

   >> ## b) complex 'x' with different kinds
 of NaN
     >> x0 <- c(0,1, NA,
 NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0)
     >> ##           ---
 = NA_real_  but that does not exist e.g., in R 2.3.1
     >> ##               
    similarly,  '1L', '2L', .. do
 not exist e.g., in R 2.3.1
     >> (z
 <- z[is.na(z)])
     >> outerID
 <- function(x,y, ...) { ## ugly; can we get outer() to
 work ?
     >> r <- matrix( ,
 length(x), length(y))
     >> for(i
 in seq(along=x))
     >> for(j in
 seq(along=y))
     >> r[i,j] <-
 identical(z[i], z[j], ...)
     >>
 r
     >> }

 >> ## Very strictly - in the sense of identical() --
 these 12 complex numbers all differ:

 >> ## a version that works in older versions of R,
[[elided Yahoo spam]]

 >> outerID.picky <- function(x,y) {
     >> nF <-
 length(formals(identical)) - 2

 >> do.call("outerID", c(list(x, y),
 as.list(rep(FALSE, nF))))
     >>
 }
     >> oldR <-
 !exists("getRversion") || getRversion() <
 "3.0.0" ## << FIXME: 3.0.0 is  a wild
 guess
     >> symnum(id.z <-
 outerID.picky(z,z)) ## == Diagonal matrix [newer versions of
 R]
     >> try(# for older R
 versions
     >>
 stopifnot(identical(id.z, outerID(z,z)), oldR ||
 identical(id.z, diag(12) == 1))

 >> )
     >> (mz <- match(z,
 z)) # currently different {NA,NaN} patterns differ - not in
 print()/format() _FIXME_
     >> zRI
 <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern :
     >> print(cbind(format = format(z),
 t(zRI), mz), quote=FALSE)

     >> ## compute  match(z[i], z) ,
 for  i = 1,2,..,12  :
     >> (m1z
 <- sapply(z, match, table = z))

 >> ## 1 2 1 2 2 2 1 2 2 2 1 2   # R 1.2.3 
 (2001-04-26)
     >> ## 1 2 3 4 1 3 7
 8 2 4 8 7   # R 1.4.1  (2002-01-30)
     >> ## 1 2 3 4 1 3 7 8 2 4 8 12  #
 R 1.5.1  (2002-06-17)
     >> ## 1 2
 3 4 1 3 7 8 2 4 8 12  # R 1.8.1  (2003-11-21)
     >> ## 1 2 3 4 1 3 7 8 2 4 8 12  #
 R 2.0.1  (2004-11-15)
     >> ## 1 2
 3 4 1 3 7 4 2 4 4 12  # R 2.1.1  (2005-06-20)
     >> ## 1 2 3 4 1 3 7 4 2 4 4 12  #
 R 2.3.1  (2006-06-01)
     >> ## 1 2
 3 4 1 3 7 8 2 4 8 12  # R 2.5.1  (2007-06-27)
     >> ## 1 2 3 4 1 3 7 4 2 4 4 12  #
 R 2.10.1 (2009-12-14)
     >> ## 1 2
 3 4 1 3 7 4 2 4 4 12  # R 3.1.1  (2014-07-10)
     >> ## 1 2 3 4 1 3 7 4 2 4 4 12  #
 R 3.2.5 -- and 3.3.0 patched
     >>
 ## 1 2 1 2 1 1 1 1 2 2 1 2   # <<--
 Martin's R-devel and proposed future R

     >>
 if(!exists("anyNA", mode="function"))
 anyNA <- function(x) any(is.na(x))

 >> stopifnot(apply(zRI, 2, anyNA)) # *all* are  NA
 *or* NaN (or both)
     >> is.NA
 <- function(.) is.na(.) & !is.nan(.)
     >> (iNaN <- apply(zRI, 2,
 function(.) any(is.nan(.))))
     >>
 (iNA <-  apply(zRI, 2, function(.) any(is.NA (.)))) #
 has non-NaN NA's
     >> ## In
 Martin's version of R-devel :

 >> stopifnot(identical(m1z == 1, iNA),
     >> identical(m1z == 2, !iNA))
     >> ## m1z uses match(x, *) with
 length(x) == 1 and failed in R 3.3.0

 >> stopifnot(identical(m1z, mz))

 >> ______________________________________________
     >> R-devel at r-project.org mailing
 list
     >> https://stat.ethz.ch/mailman/listinfo/r-devel

     >
 ______________________________________________
     > R-devel at r-project.org
 mailing list
     > https://stat.ethz.ch/mailman/listinfo/r-devel