[Rd] cmdscale & non-Euclidean dissimilarities

Wed Jun 10 14:27:27 CEST 2009

Dear R gurus,

I think that cmdscale() function has problems with non-Euclidean
distances which have negative eigenvalues. The problems are two-fold: 

(1) Wrong eigenvalue is removed: there will be at least one zero
eigenvalue in cmdscale, and the function assumes it is the last one.
With non-Euclidean dissimilarities you will have negative eigenvalues,
and the zero eigenvalue will be the last positive one before negative
eigenvalues. Now the function returns the zero eigenvalue and
corresponding zero-eigenvector, but drops the last negative eigenvalue
(which has larger absolute value than any other negative eigenvalue). 

(2) Gower (1985) says that with non-Euclidean matrices and negative
eigenvalues you will have imaginary axes, and the distances on imaginary
axes (negative eigenvalues) should be subtracted from the distances on
real axes (positive eigenvalues). The formulation in the article is like
this (Gower 1985, p. 93):

"""
f_{ii} + f_{jj} - 2 f_{ij} = d_{ij}^2 =
\sum_{p=1}^r (l_{pi} - l_{pj})^2 - \sum_{p=r+1}^{r+s} (l_{pi} - l_{pj})^ 2

This is the usual "Pythagorean" representation of squared distances in
terms of coordinates $l_{pi} (p = 1, 2 \ldots r+s)$, except that for
$p>r$ the coordinates become purely imaginary.
"""

This also suggests that for GOF (goodness of fit) measure of cmdscale()
the negative eigenvalues should be subtracted from the sum of positive
eigenvalues. Currently, the function uses two ways: the sum of abs
values of eigenvalues (and it should be sum of eigenvalues with their
signs), and the sum of above-zero eigenvalues for the total. The latter
makes some sense, but the first looks non-Gowerian. 

Reference

Gower, J. C. (1985) Properties of Euclidean and non-Euclidean distance
matrices. Linear Algebra and its Applications 67, 81--97. 

The following change seems to avoid both problems. The change removes
only the eigenvalue that is closest to the zero. There may be more than
one zero eigenvalue (or of magnitude 1e-17), but this leaves the rest
there. It also changes the way the first alternative of GOF is
calculated. This changes the code as little as possible, and it still
leaves behind some cruft of the old code that assumed that last
eigenvalue is the zero eigenvalue.

--- R/src/library/stats/R/cmdscale.R	(revision 48741)
+++ R/src/library/stats/R/cmdscale.R	(working copy)
@@ -56,6 +56,9 @@
         x[non.diag] <- (d[non.diag] + add.c)^2
     }
     e <- eigen(-x/2, symmetric = TRUE)
+    zeroeig <- which.min(abs(e$values))
+    e$values <- e$values[-zeroeig]
+    e$vectors <- e$vectors[ , -zeroeig, drop = FALSE]
     ev <- e$values[1L:k]
     if(any(ev < 0))
         warning(gettextf("some of the first %d eigenvalues are < 0", k),
@@ -63,9 +66,9 @@
     points <- e$vectors[, 1L:k, drop = FALSE] %*% diag(sqrt(ev), k)
     dimnames(points) <- list(rn, NULL)
     if (eig || x.ret || add) {
-        evalus <- e$values[-n]
+        evalus <- e$values
         list(points = points, eig = if(eig) ev, x = if(x.ret) x,
              ac = if(add) add.c else 0,
-             GOF = sum(ev)/c(sum(abs(evalus)), sum(evalus[evalus > 0])))
+             GOF = sum(ev)/c(sum(evalus), sum(evalus[evalus > 0])))
     } else points
 }

Best wishes, Jari Oksanen
-- 
Jari Oksanen -- Dept Biology, Univ Oulu, FI-90014 Oulu, Finland
email jari.oksanen at oulu.fi, homepage http://cc.oulu.fi/~jarioksa/