[Rd] Re: [R] Canberra dist and double zeros

Prof Brian Ripley ripley@stats.ox.ac.uk
Tue, 6 Mar 2001 09:40:30 +0000 (GMT)


On Tue, 6 Mar 2001, Jari Oksanen wrote:

> ripley@stats.ox.ac.uk said:
> > [Moved to R-devel, as more appropriate.]
>
> This means that I probably have to subsribe (momentarily) for R-devel which I
> have regarded as too technical for non-developer like me.

We'll keep you on the Cc: list.  Normally things like this are on R-devel,
as they are specialized.

> ripley@stats.ox.ac.uk said:
> > I am sure we should do something, but is this exactly right?
>
> I am not sure either: it is right for me in my present applications, but I
> think it may not be right in general.  I used dist() for community data, where
> zero *is* zero (not only approximately zero floating point number) and means
> that the species is absent, and of course, all numbers are positive or zeros.
> Canberra distance is OK for negative numbers as well, and so x_i = -1, y_1 = 1
> would yield 2/0 which probably shouldn't be regarded as zero, but rather as
> NaN.  So a better test would be for above-zero numerator or explicitly for
> both x_i && y_i.

I think it should be Inf, and was going to comment that was another
problem.

> ripley@stats.ox.ac.uk said:
> >  The issue is if count should be incremented if sum == 0.0 or not.
>
> I don't know, and I don't have Lance & Williams 1967 to check. However, more
> recent papers by Canberra people do *not* increment count for double-zeros
> (Faith, Minchin, Belbin 1987. Compositional dissimilarity as a robust measure
> of ecological distance. Vegetatio 69, 57-68.).  I have no idea about the
> really *correct* solution or what are the arguments for incrementing or not
> incrementing count. At least not incrementing means that count varies with
> pairs of observations instead of being a simple down-scaling by a constant for
> the entire matrix.  However, probably the original Lance & Williams choice was
> to increment only for sum > 0.

Note count is only relevant if count < nc, and the code in 1.2.2 is wrong:
it should have been

    if(count != nc) dist /= ((double)count/nc);

Fortunately, it was never used.

> Some other people may have better libraries to
> check both the choice and the argument (I may have a look there, but I would
> be surprised if I find Aust. Comput. J. 1, 15-20 here).  Checking for
> incrementing count would need testing above-zero denominator which begins to
> look ugly coding if we need testing for numerator as well.

You do anyway to get 2/0 different from 0/0.  We can code any solution,
and this is simple and clean compared to, say, scan.c!

I am going to implement that x1=x1=0 is equivalent to missing, and that
x1=+1, x2=-1 gives 2/0 = Inf.

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._