[R] Otpmial initial centroid in kmeans

Thu Jul 3 10:33:12 CEST 2008

On Thu, 2008-07-03 at 11:35 +0800, Chua Siang Li wrote:
> Helo there.  I am using kmeans of base package to cluster my customers.  As
>    the results of kmeans is dependent on the initial centroid, may I know:
>    1) how can we specify the centroid in the R function? (I don't want random
>    starting pt)

You can specify coordinates on the variables you are clustering for the
k centroids you wish to start from. You pass this as argument 'centers'.
So you can come up with any centroids you wish to start from.

One option here is to do a hierarchical clustering (using say the
average link or Ward's method) of your data, select a number of clusters
and computer the centroids of those clusters, then use those centroids
as the starting points for kmeans(). MASS (the book) by Venables and
Ripley (2002, Modern Applied Statistics with S 4th Ed., Springer) has an
example and R scripts to follow. It is in the multivariate chapter
(sorry I can't be more specific, my copy of the book is at work). The R
scripts come with the MASS package (in the VR bundle) that is part of R.
So have a look for them in your installation. On my linux box they are
in:

R_HOME/library/MASS/scripts/

where R_HOME is the location where R is installed or running from.

>    2) how to determine the optimal (if not, a good) centroid to start with?  (I
>    am not after the fixed seed solution as it only ensure that the cluster is
>    the same at every run but not necessary a good cluster.)

For anything other than small problems I suspect that you either can't
or you can't do it in a reasonable amount of time. There are a vast
number of possible configurations to evaluate. The recommendation is to
use several random starts and compare them or use the best solution.

kmeans() has argument nstart to specify how many random starts to try.

cascadeKM() in package vegan allows you to do the many random starts and
it retains the best solution for k = 2, ..., n, where n is specified by
the user. This function has two criteria to evaluate the optimal k (for
the k's tried) so can guide you as to how many clusters to retain and
then use the best of the random starts for that k. But remember, you
haven't tried *all* solutions so these criteria are a guide only.

HTH

G

>    Many Thanks.
>    siangli
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.