# dist() {"mva" package} bug: treats +/- Inf as NA

Martin Maechler Martin Maechler <maechler@stat.math.ethz.ch>
Tue, 22 Oct 2002 14:29:54 +0200

```>>>>> "BDR" == Brian D Ripley <ripley@stats.ox.ac.uk>
>>>>>     on Mon, 21 Oct 2002 17:44:53 +0100 (BST) writes:

BDR> I think this is definitely better left as is (and
BDR> definitely so for 1.6.1), perhaps documenting the fact.

I don't agree. Whenever there's only one +/- Inf in a pair, the
result is well-defined for most metrics, (often = +Inf, but for Canberra).
The current behavior of silently treating Infs the same as NA is
not acceptable IMO.

Vincent's example was basically

dist(1:4, c(1e10, 2:4))
vs
dist(1:4, c(Inf, 2:4))

where the 2nd should be the "limit" of the first, or at the
least should not silently give the equivalent of
dist(1:4, c(NA, 2:4))

BDR> 1.6.1), perhaps documenting the fact.  What is done
BDR> seems quite sensible to me: you really can't have
BDR> infinite values in an L_2 or L_1 space, and so
BDR> Euclidean and L_1 (= Manhattan) distances are not
BDR> defined.

BDR> One could argue for just one infinity point,
you mean, when for a scalar pair, one is +/- Inf, the other not?
Yes, this is exactly my point.

BDR> with zero distance between infinity points and infinite to all
BDR> finite points, for example (and NA to missing values).

BDR> It really isn't just one answer.

I think it's not hard to do the right thing in the cases where
that's well defined (e.g. when there's at most one Inf per pair),
and
use the current behavior {dropping a pair} in the other
cases, namely in exactly those where IEEE-arithmetic (excuse the
misnomer) would return a NaN.

Martin

BDR> On Mon, 21 Oct 2002, Martin Maechler wrote:

>> Vince Carey found this (thank you!).
>> Since the fix to the problem is not entirely obvious, I post
>> this to R-devel as RFC:
>>
>> help(dist)  says:
>>
>> >>  Missing values are allowed, and are excluded from all computations
>> >>  involving the rows within which they occur.  If some columns are
>> >>  excluded in calculating a Euclidean, Manhattan or Canberra
>> >>  distance, the sum is scaled up proportionally to the number of
>> >>  columns used. If all pairs are excluded when calculating a
>> >>  particular distance, the value is `NA'.
>>
>> but the C code in  ....../src/library/mva/src/distance.c,
>> has, e.g. for the Euclidean distance :
>>
>> count= 0;
>> dist = 0;
>> for(j = 0 ; j < nc ; j++) {
>> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
>> dev = (x[i1] - x[i2]);
>> dist += dev * dev;
>> count++;
>> }
>> i1 += nr;
>> i2 += nr;
>> }
>> if(count == 0) return NA_REAL;
>> if(count != nc) dist /= ((double)count/nc);
>> return sqrt(dist);
>>
>> where it is clear that "R_FINITE(*)" should in principle be
>> replaced by  "!ISNAN(*)".
>>
>> Note however that "Inf - Inf -> NaN" and e.g the canberra metric
>> has more ways to get "NaN".
>>
>> The current code drops all pairs with an +-Inf in it.
>> I would be inclined to really replace
>> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
>> by
>> if(!ISNAN(x[i1])   && !ISNAN(x[i2])) {
>>
>> for all metrics -- for R-patched.
>>
>> But I'd also see reasons where we'd want to be smarter/different
>> than that. One possibility would be to drop the pair (as
>> currently) also when both are not finite,
>> for "binary" to signal an error for +- Inf,
>> and for "Canberra" :
>> Of course,  d =  |x - y| / |x + y|
>> will be 1 , when one of {x,y} is infinite.
>> This could be considered the desired answser, or also
>> we may give a warning in any case.
>>
>> Opinions?
>>
>>
>> Martin Maechler <maechler@stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
>> Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
>> ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
>> phone: x-41-1-632-3408		fax: ...-1228			<><

BDR> --
BDR> Brian D. Ripley,                  ripley@stats.ox.ac.uk
BDR> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
BDR> University of Oxford,             Tel:  +44 1865 272861 (self)
BDR> 1 South Parks Road,                     +44 1865 272860 (secr)
BDR> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

```