[R] -means, hybrid clustering or similar implementations on R

Christian Hennig hennig at stat.math.ethz.ch
Wed May 7 10:29:32 CEST 2003


Hi,

On Wed, 7 May 2003, Skanda Kallur; MEngg wrote:

> Hi,
> 
> I would like to know if someone knows an extended implementation of k-means in R to find appropriate number of clusters for a given k-dimensional data. 

You may use pam in library(cluster). Optimal number of clusters by maximizing
pam(x, k) $ silinfo $ avg.width
over k (number of clusters). Note that this does not work with k=1.
pam does not exactly the same as k-means. By default, it uses euclidean 
distances, not their squares ("k-median") and all cluster centers are
present data points (medoids). If you want to "emulate" k-means, you can
provide x as a distance matrix with squared euclidean distances (which is
often worse than the default, e.g. in case of outliers). 

An alternative is the use of EMclust in library(mclust), which decides
about the optimal number of clusters by Bayesian Information
Criterion (BIC). Set the parameter emModelNames="EII" for the mixture 
model analogon to k-means (but do this only if you are sure that you want
something k-means-like and not a more flexible model).

In general, the number of clusters-problem is difficult, because is does
not only depend on the data but also on your concept of a "cluster". The
BIC has a bit better theoretical support than pam's average silhouette
width, but the problem is far from being solved.

Christian

-- 
***********************************************************************
Christian Hennig
Seminar fuer Statistik, ETH-Zentrum (LEO), CH-8092 Zuerich (currently)
and Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
hennig at stat.math.ethz.ch, http://stat.ethz.ch/~hennig/
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag.de




More information about the R-help mailing list