# [Rd] complex NA's match(), etc: not back-compatible change proposal

Suharto Anggono Suharto Anggono suharto_anggono at yahoo.com
Fri Jun 3 21:13:11 CEST 2016

```With 'z' of length 8 below, or of length 12 previously, one may try
sapply(rev(z), match, table = rev(z))
match(rev(z), rev(z))

I found that the two results were different in R devel r70604.

A shorter one:

> z <- complex(real = c(0,NaN,NaN), imaginary = c(NA,NA,0))
> sapply(z, match, table = z)
[1] 1 1 2
> match(z, z)
[1] 1 1 3

An explanation of the behavior: With normal equality, if z[2] is equal to z[1] and z[3] is not equal to z[1], z[3] is not equal to z[2]. It is not the case here with 'cequal'. However, it seems that the property is assumed in usual case of 'match'.

For it, just changing 'cequal' so that a complex number that has both NA and NaN matches NA and doesn't match NaN is enough. It also makes length(unique(.)) not order-dependent.

For more change, I am fine with '1 A'.
--------------------------------------------
On Mon, 30/5/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal

Cc: R-devel at r-project.org
Date: Monday, 30 May, 2016, 5:48 PM

>>>>> Suharto Anggono

>>>>>     on Sat, 28 May
2016 09:34:08 +0000 writes:

> On 'factor', I meant the case where
'levels' is not
> specified, where 'unique' is called.

I see, thank you.

>> factor(c(complex(real=NaN),
complex(imaginary=NaN)))
> [1] NaN+0i <NA>
> Levels: NaN+0i

> Look at <NA> in the result above.
Yes, it happens in
> earlier versions of R, too.

Yes; let's call this "problem 1"

> On matching both NA and NaN, another
consequence is that
> length(unique(.)) may depend on order.
> Example using R devel r70604:

>> x0 <- c(0,1, NA, NaN); z <-
outer(x0,x0, complex, length.out=1); rm(x0)
>> (z <- z[is.na(z)])
> [1]       NA
NaN+  0i       NA NaN+
1i       NA
NA       NA
NA
>
[9]   0+NaNi   1+NaNi
NA NaN+NaNi
>> length(print(unique(z)))
> [1]     NA NaN+0i
> [1] 2
>> length(print(unique(c(z[8],
z[-8]))))
> [1] NA
> [1] 1
>
--------------------------------------------

Thank you, Suharto. I agree these are even more convincing
reasons to consider changing.
Let's call this ("matching both NA and NaN")  "problem
2".

I think we agree that the R-devel -- comparted to previous
versions -- *is* consistent in its (C level) functions
cequal()
and  chash() and also is consistent with the
documentation
of match()/unique()/duplicated().

Hence I think a change would have to affect all of the
above,
including a change of documentation.

Also, resolution of "problem 1" and "problem 2" are related,
but
--I think-- almost separate.
For the following, let's use a vector notation for complex
numbers, say
(a, b) :== complex(real = a, imaginary = b)

With R  (showing relevant examples):
##------------------------------------------------------------------------------
options(width = max(85, getOption("width"))) # so 'z' prints
in one line
p.z <- function(z)
print(noquote(paste0("(",Re(z),",",Im(z),")")))
z <- c(1,NA,NaN); z <- outer(z,z, complex,
length.out=1); (z <- z[is.na(z)])
##     NA NaN+  1i
NA       NA
NA   1+NaNi
NA NaN+NaNi
p.z(z)
##  (NA,1)  (NaN,1)  (1,NA)
(NA,NA)  (NaN,NA)  (1,NaN)  (NA,NaN)
(NaN,NaN)
length(p.z(unique(z[ 1:8 ])))
## [1] (NA,1)  (NaN,1)
## [1] 2
length(p.z(unique(z[ c(8,1:7) ])))
## [1] (NaN,NaN) (NA,1)
## [1] 2
length(p.z(unique(z[ c(7:8,1:6) ])))
## [1] (NA,NaN)
## [1] 1
##------------------------------------------------------------------------------

Problem 1:
To me, at the moment, it would seem most "natural" to
consider a
change where the match()/unique()/duplicated()
behavior  matched
the behavior of print()/format()/as.character()
for such
complex vectors.
I think this would automatically solve the issue that
sometimes

length(unique(as.character(x))) >
length(unique(x))

The are principally two solutions to this:

A: change  match()/unique()/duplicated()
B: change  print()/format()/as.character()

For A -- which seems "less disruptive" and more
desirable to
me -- we would have to change cequal() {and chash()!}
and say
that complex numbers with NA|NaN  "match" if
they have any NA, but
otherwise, both the regular (r,i) and the NaN must be
at the
exact same places (and *different* NaNs should match,
of course).

Problem 2:   unique(z[i])  depends on
the permutation 'i'

What should a change be here ...  notably after
the "proposed"
(rather only "considered") change   '1
A' above ?

Can "the" new behavior easily be described in words
(if '1 A'
above is already assumed)?

At the moment, I would not tackle Problem 2.
It would become less problematic once  Problem 1 is
solved
according to '1 A', because it least  length(unique(.))
would
not change:  It would contain *one* z[] with an NA, and
all the
other z[]s.

Opinions ?  Thank you in advance for chiming in..

Martin Maechler,
ETH Zurich

> On Mon, 23/5/16, Martin Maechler <maechler at stat.math.ethz.ch>
wrote:

> Subject: Re: [Rd] complex NA's match(),
etc: not back-compatible change proposal

> Cc: R-devel at r-project.org
> Date: Monday, 23 May, 2016, 11:06 PM

>>>>>>
> Suharto Anggono Suharto Anggono via
R-devel <r-devel at r-project.org>
>>>>>>      on Fri, 13
> May 2016 16:33:05 +0000 writes:

>     > That, for example,
complex(real=NaN)
> and complex(imaginary=NaN) are regarded
as equal makes it
> possible that

>     >
> length(unique(as.character(x))) >
length(unique(x))

>     > (current code of
> function 'factor' doesn't expect it).

> Thank you, that is an
> interesting remark - but is already
true,
> in
> [[elided Yahoo spam]]

> ..
> and of course this is because we do
> *print*   0+NaNi  etc,
> i.e., we
> differentiate the  non-NA-but-NaN
complex values in
> formatting / printing but not in
match(),
> unique() ...

> and indeed,
> with the  'z'  example below,
>
> fz <- factor(z,z)
> gives a warnings about
> duplicated levels and gives such
warnings
> also in current (and previous) versions
of R,
> at least for the slightly
> larger z
> I've used in the tests/reg-tests-1c.R
example.

> For the moment I can live with
> that warning, as I don't think
> factor()s
> are constructed from complex numbers
"often"...
> and the performance of factor() in the
more
> regular cases is important.

>> Yes, an argument for the behavior is
that
> NA and NaN are of one kind.
>> On my
> system, using 32-bit R for Windows from
binary from CRAN,
> the result of sapply(z, match, table = z)
(not in current
> R-devel) may be different from below:
>
>> 1 2 3 4 1 3 7 8 2 4 8 12  # R
2.10.1, different from
> below
>     > 1 2 3 4 1 3 7 8 2 4 8 12
> # R 3.2.5, different from below

> interesting, thank you... and another
reason
> why the change
> (currently only in R-devel)
> may have been a good one: More
uniformity.

>     > I noticed that, by
> function 'cequal' in unique.c, a complex
number that
> has both NA and NaN matches NA and also
matches NaN.

>     >> x0 <- c(0,1,
> NA, NaN); z <- outer(x0,x0, complex,
length.out=1);
> rm(x0)
>     >> (z <-
> z[is.na(z)])
>     > [1]
>    NA NaN+  0i       NA NaN+
1i
>      NA       NA
>    NA       NA
>
>> [9]   0+NaNi   1+NaNi
>    NA NaN+NaNi

>     >> sapply(z, match, table =
> z[8])
>     > [1] 1 1 1 1 1 1 1 1 1 1 1
> 1
>     >> match(z, z[8])
>     > [1] 1 1 1 1 1 1 1 1 1 1 1 1

> Yes, I see the same. But is
> n't it what we expect:

> All of our z[] entries has at least one
NA or a
> NaN in its real
> or imaginary, and since z[8]
> has both, it does match with all
> z[]'s
> either because of the NA or because of
the NaN in common.

> Hence, currently, I don't
> think this needs to be changed...
> but if
> there are other reasons / arguments ...

> Thank you again,
> Martin
> Maechler

>     >> sessionInfo()
>
>   > R Under development (unstable)
(2016-05-12
> r70604)
>     > Platform:
> i386-w64-mingw32/i386 (32-bit)
>     >
> Running under: Windows XP (build 2600)
Service Pack 2

>     > locale:
>     > [1] LC_COLLATE=English_United
> States.1252
>     > [2]
> LC_CTYPE=English_United States.1252
>
>> [3] LC_MONETARY=English_United
States.1252
>     > [4] LC_NUMERIC=C
>
>   > [5] LC_TIME=English_United
States.1252

>     > attached base
> packages:
>     > [1] stats
>    graphics  grDevices utils
>    datasets  methods   base

>     >
> -----------------
>>>>>>
> Martin Maechler <maechler at
stat.math.ethz.ch>
>>>>>>      on Tue, 10
> May 2016 16:08:39 +0200 writes:

>     >> This is an RFC /
announcement
> related to the 2nd part of PR#16885
>
>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885
>     >> about  complex NA's.

>     >> The (somewhat
> rare) incompatibility in R's 3.3.0
match() behavior for
> the
>     >> case of complex numbers
> with NA & NaN's {which has been fixed
for R 3.3.0
>     >> patched in the mean time}
> triggered some more comprehensive
"research".

>     >> I found that we
> have had a long-standing inconsistency at
least between
> the
>     >> documented and the real
> behavior.  I am claiming that the
documented
>     >> behavior is desirable and
hence
> R's current "real" behavior is bugous,
and
>     >> I am proposing to change
it, in
> R-devel (to be 3.4.0) for now.

>     > After the  "roaring
> unanimous" assent  (one private msg
>
>   > encouraging me to go forward, no
dissenting voice,
> hence an
>     > "odds ratio"
> of  +Inf  in favor ;-)

>
>   > I have now committed my proposal
to R-devel (svn
> rev. 70597) and
>     > some of us will
> be seeing the effect in package space
within a
>     > day or so, in the CRAN checks
> against R-devel (not for
>     >
> bioconductor AFAIK; their checks using
R-devel only when it
> less
>     > than ca 6 months from
> release).

>     >
> It's still worthwhile to discuss the
issue, if you come
> late
>     > to it, notably as
> ---paraphrasing Dirk on the
R-package-devel list---
>     > the release of 3.4.0 is almost
a
> year away, and so now is the
>     > best
> time to tinker with the API, in other
words, consider
> breaking
>     > rarely used legacy
> APIs..

>     > Martin

>
>>> In help(match) we have been
saying

>     >> |  Exactly
> what matches what is to some extent a
matter of
> definition.
>     >> |  For all
> types, \code{NA} matches \code{NA} and no
other value.
>     >> |  For real and complex
values,
> \code{NaN} values are regarded
>
>>> |  as matching any other
\code{NaN} value, but not
> matching \code{NA}.

>
>>> for at least 10 years.  But we
don't do that
> at all in the
>     >> complex case
> (and AFAIK never got a bug report about
it).

>     >> Also, e.g.,
> print(.) or format(.) do simply use
"NA" for
> all
>     >> the different complex
> NA-containing numbers, where OTOH,
>
>>> non-NA NaN's { <=>
!is.nan(z) &
> is.na(z) }
>     >> in format() or
> print() do show the NaN in real and/or
imaginary
>     >> parts; for an example,
look at
> the "format" column of the matrix
>     >> below, after
> 'print(cbind' ...

>     >> The current match()---and
> duplicated(), unique() which are based on
the same
>     >> C code---*do* distinguish
almost
> all complex NA / NaN's which is
>
>>> NOT according to documentation. I
have found that
> this is just because of
>     >> of
> our hashing function for the complex
case, chash() in
> R/src/main/unique.c,
>     >> is
> bogous in the sense that it is not
compatible with the above
> documentation
>     >> and also not
> with the cequal() function (in the same
file uniqu.c) for
> checking
>     >> equality of complex
> numbers.

>     >> As
> I have found,, a *simplified* version of
the chash()
> function
>     >> to make it
> compatible with cequal() does solve all
the problems
> I've
>     >> indicated,  and the
> current plan is to commit that change ---
after some
>     >> discussion time, here on
R-devel
> ---  to the code base.

>
>   >> My change passes  'make
check-all'
> fine, but I'm 100% sure that there will
>     >> be effects in
package-space. ...
> one reason for this posting.

>     >> As mentioned above, note
that
> the chash() function has been in
>
>>> use for all three functions
>
>>> match()
>     >>
> duplicated()
>     >> unique()
>     >> and the change will affect
all
> three --- but just for the case of
complex
>     >> vectors with NA or NaN's.

> small R session -- using my version of
R-devel
>     >> == the proposition:
>     >> The R script
> ('complex-NA-short.R') for (a bit more
than) the
>     >> session is attached {{you
can
> attach  text/plain easily}}:

>     >>> x0 <- c(0,1, NA,
NaN); z
> <- outer(x0,x0, complex,
length.out=1); rm(x0)
>     >>> ##
>    --- = NA_real_  but that does not
exist e.g.,
> in R 2.3.1
>     >>> ##
>            similarly,  '1L',
> '2L', .. do not exist e.g., in R 2.3.1
>     >>> (z <- z[is.na(z)])
>     >> [1]       NA NaN+
> 0i       NA NaN+  1i
NA
>      NA       NA
>    NA
>     >>
> [9]   0+NaNi   1+NaNi
>    NA NaN+NaNi
>     >>>
> outerID <- function(x,y, ...) { ##
ugly; can we get
> outer() to work ?
>     >> +
>    r <- matrix( , length(x),
length(y))
>     >> +     for(i in
> seq(along=x))
>     >> +
>    for(j in seq(along=y))
>
>>> +             r[i,j]
<-
> identical(z[i], z[j], ...)
>     >>
> +     r
>     >> + }
>     >>> ## Very strictly - in
the
> sense of identical() -- these 12 complex
numbers all
> differ:
>     >>> ## a version that
> works in older versions of R, where
[[elided Yahoo spam]]
>     >>> outerID.picky
> <- function(x,y) {
>     >> +
>    nF <- length(formals(identical))
- 2
>     >> +
>    do.call("outerID", c(list(x, y),
> as.list(rep(FALSE, nF))))
>     >> +
> }
>     >>> oldR <-
> !exists("getRversion") || getRversion()
<
> "3.0.0" ## << FIXME: 3.0.0 is  a
wild
> guess
>     >>> symnum(id.z <-
> outerID.picky(z,z)) ## == Diagonal matrix
> R]
>
>
>     >> [1,] | . . . .
> . . . . . . .
>     >> [2,] . | . . .
> . . . . . . .
>     >> [3,] . . | . .
> . . . . . . .
>     >> [4,] . . . | .
> . . . . . . .
>     >> [5,] . . . . |
> . . . . . . .
>     >> [6,] . . . . .
> | . . . . . .
>     >> [7,] . . . . .
> . | . . . . .
>     >> [8,] . . . . .
> . . | . . . .
>     >> [9,] . . . . .
> . . . | . . .
>     >> [10,] . . . . .
> . . . . | . .
>     >> [11,] . . . . .
> . . . . . | .
>     >> [12,] . . . . .
> . . . . . . |
>     >>> try(# for
> older R versions
>     >> +
> stopifnot(identical(id.z, outerID(z,z)),
oldR ||
> identical(id.z, diag(12) == 1))
>
>>> + )
>     >>> (mz <-
> match(z, z)) # currently different
{NA,NaN} patterns differ
> - not in print()/format() _FIXME_
>
>>> [1] 1 2 1 2 1 1 1 1 2 2 1 2
>
>>>> zRI <- rbind(Re=Re(z),
Im=Im(z)) # and see
> the pattern :
>     >>>
> print(cbind(format = format(z), t(zRI),
mz), quote=FALSE)
>     >>
> format   Re   Im   mz
>     >> [1,]       NA
> <NA> 0    1
>     >> [2,]
> NaN+  0i NaN  0    2
>     >>
> [3,]       NA <NA> 1    1
>     >> [4,] NaN+  1i NaN  1
2

>     >> [5,]       NA
> 0    <NA> 1
>     >> [6,]
>      NA 1    <NA> 1
>
>   >> [7,]       NA <NA>
<NA>
> 1
>     >> [8,]       NA
> NaN  <NA> 1
>     >>
> [9,]   0+NaNi 0    NaN  2
>
>   >> [10,]   1+NaNi 1
NaN  2
>     >> [11,]       NA
> <NA> NaN  1
>     >> [12,]
> NaN+NaNi NaN  NaN  2
>     >>>

>     >>
> -------------------------------
>
>>> Note that 'mz <- match(z, z)'
and hence
> the last column of the matrix above
>
>>> are very different in current R,

>     >> distinguishing most kinds
of NA
> / NaN  against the documentation (and
the
>     >> real/numeric case).

>     >> Martin
> Maechler
>     >> R Core Team

>
>>> ### Basically a shortened version
of  the PR#16885
> -- complex part b)
>     >> ### of
> R/tests/reg-tests-1c.R

>
>   >> ## b) complex 'x' with
different kinds
> of NaN
>     >> x0 <- c(0,1, NA,
> NaN); z <- outer(x0,x0, complex,
length.out=1); rm(x0)
>     >> ##           ---
> = NA_real_  but that does not exist
e.g., in R 2.3.1
>     >> ##
>    similarly,  '1L', '2L', .. do
> not exist e.g., in R 2.3.1
>     >> (z
> <- z[is.na(z)])
>     >> outerID
> <- function(x,y, ...) { ## ugly; can
we get outer() to
> work ?
>     >> r <- matrix( ,
> length(x), length(y))
>     >> for(i
> in seq(along=x))
>     >> for(j in
> seq(along=y))
>     >> r[i,j] <-
> identical(z[i], z[j], ...)
>     >>
> r
>     >> }
>
>>> ## Very strictly - in the sense
of identical() --
> these 12 complex numbers all differ:
>
>>> ## a version that works in older
versions of R,
> [[elided Yahoo spam]]
>
>>> outerID.picky <- function(x,y)
{
>     >> nF <-
> length(formals(identical)) - 2
>
>>> do.call("outerID", c(list(x, y),
> as.list(rep(FALSE, nF))))
>     >>
> }
>     >> oldR <-
> !exists("getRversion") || getRversion()
<
> "3.0.0" ## << FIXME: 3.0.0 is  a
wild
> guess
>     >> symnum(id.z <-
> outerID.picky(z,z)) ## == Diagonal matrix
> R]
>     >> try(# for older R
> versions
>     >>
> stopifnot(identical(id.z, outerID(z,z)),
oldR ||
> identical(id.z, diag(12) == 1))
>
>>> )
>     >> (mz <- match(z,
> z)) # currently different {NA,NaN}
patterns differ - not in
> print()/format() _FIXME_
>     >> zRI
> <- rbind(Re=Re(z), Im=Im(z)) # and see
the pattern :
>     >> print(cbind(format =
format(z),
> t(zRI), mz), quote=FALSE)

>     >> ## compute  match(z[i],
z) ,
> for  i = 1,2,..,12  :
>     >> (m1z
> <- sapply(z, match, table = z))
>
>>> ## 1 2 1 2 2 2 1 2 2 2 1 2   #
R 1.2.3
> (2001-04-26)
>     >> ## 1 2 3 4 1 3 7
> 8 2 4 8 7   # R 1.4.1  (2002-01-30)
>     >> ## 1 2 3 4 1 3 7 8 2 4 8
12  #
> R 1.5.1  (2002-06-17)
>     >> ## 1 2
> 3 4 1 3 7 8 2 4 8 12  # R 1.8.1
(2003-11-21)
>     >> ## 1 2 3 4 1 3 7 8 2 4 8
12  #
> R 2.0.1  (2004-11-15)
>     >> ## 1 2
> 3 4 1 3 7 4 2 4 4 12  # R 2.1.1
(2005-06-20)
>     >> ## 1 2 3 4 1 3 7 4 2 4 4
12  #
> R 2.3.1  (2006-06-01)
>     >> ## 1 2
> 3 4 1 3 7 8 2 4 8 12  # R 2.5.1
(2007-06-27)
>     >> ## 1 2 3 4 1 3 7 4 2 4 4
12  #
> R 2.10.1 (2009-12-14)
>     >> ## 1 2
> 3 4 1 3 7 4 2 4 4 12  # R 3.1.1
(2014-07-10)
>     >> ## 1 2 3 4 1 3 7 4 2 4 4
12  #
> R 3.2.5 -- and 3.3.0 patched
>     >>
> ## 1 2 1 2 1 1 1 1 2 2 1 2   #
<<--
> Martin's R-devel and proposed future R

>     >>
> if(!exists("anyNA", mode="function"))
> anyNA <- function(x) any(is.na(x))
>
>>> stopifnot(apply(zRI, 2, anyNA)) #
*all* are  NA
> *or* NaN (or both)
>     >> is.NA
> <- function(.) is.na(.) &
!is.nan(.)
>     >> (iNaN <- apply(zRI, 2,
> function(.) any(is.nan(.))))
>     >>
> (iNA <-  apply(zRI, 2, function(.)
any(is.NA (.)))) #
> has non-NaN NA's
>     >> ## In
> Martin's version of R-devel :
>
>>> stopifnot(identical(m1z == 1,
iNA),
>     >> identical(m1z == 2,
!iNA))
>     >> ## m1z uses match(x, *)
with
> length(x) == 1 and failed in R 3.3.0
>
>>> stopifnot(identical(m1z, mz))
>
>>>
______________________________________________
>     >> R-devel at r-project.org
mailing
> list
>     >> https://stat.ethz.ch/mailman/listinfo/r-devel

>     >
>
______________________________________________
>     > R-devel at r-project.org
> mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel

>
______________________________________________
> R-devel at r-project.org
mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

```