[R] distance in the function kmeans

Jari Oksanen jari.oksanen at oulu.fi
Sat May 29 07:53:11 CEST 2004


My thread broke as I write this at home and there were no new messages 
on this subject after I got  home. I hope this still reaches interested 
parties.

There are several methods that find centroids (means) from distance 
data. Centroid clustering methods do so, and so does classic scaling 
a.k.a. metric multidimensional scaling a.k.a. principal co-ordinates 
analysis (in R function cmdscale the means are found in C function 
dblcen.c in R sources). Strictly this centroid finding only works with 
Euclidean distances, but these methods willingly handle any other 
dissimilarities (or distances). Sometimes this results in anomalies 
like upper levels being below lower levels in cluster diagrams or in 
negative eigenvalues in cmdscale. In principle, kmeans could do the 
same if she only wanted.

Is it correct to use non-Euclidean dissimilarities when Euclidean 
distances were assumed? In my field (ecology) we know that Euclidean 
distances are often poor, and some other dissimilarities have better 
properties, and I think it is OK to break the rules (or `violate the 
assumptions'). Now we don't know what kind of dissimilarities were used 
in the original post (I think I never saw this specified), so we don't 
know if they can be euclidized directly using ideas of Petzold or 
Simpson. They might be semimetric or other sinful dissimilarities, too. 
These would be bad in the sense Uwe Ligges wrote: you wouldn't get 
centres of Voronoi polygons in original space, not even non-overlapping 
polygons. Still they might work better than the original space (who 
wants to be in the original space when there are better spaces floating 
around?)

The following trick handles the problem euclidizing space implied by 
any dissimilarity meaasure (metric or semimetric). Here mdata is your 
original (rectangular) data matrix, and dis is any dissimilarity data:

tmp <- cmdscale(dis, k=min(dim(mdata))-1, eig=TRUE)
eucspace <- tmp$points[, tmp$eig > 0.01]

The condition removes axes with negative or almost-zero eigenvalues 
that you will get with semimetric dissimilarities.

Then just call kmeans with eucspace as argument. If your dis is 
Euclidean, this is only a rotation and kmeans of eucspace and mdata 
should be equal. For other types of dis (even for semimetric 
dissimilarity) this maps your dissimilarities onto Euclidean space 
which in effect is the same as performing kmeans with your original 
dissimilarity.

Cheers, jari oksanen
--
Jari Oksanen, Oulu, Finland




More information about the R-help mailing list