[R] PAM: how to get the best number of clusters

Fri Oct 31 00:18:27 CET 2008

On Thursday 30 October 2008, Maura E Monville wrote:
> I have the book you mentioned. It basically describes the silhouette
> method. I do not have it handy as I moved so it is still in some box.
> However I cannot remember that book providing any other criterion to find
> the best clusters number.
> On the other hand I have the same problem with hierarchical clustering
> techniques.
> I use clusters as exploratory analysis because I do not have any a-priori
> knowledge that helps me make a choice.
> How can multivariate analysis help?
> I launched a loop where the silhouette test follows PAM which is passed a
> clusters number increased by 1 at each iteration.
> Since I am observing that the silhouette value is now oscillating among
> negative numbers, I wonder whether I can assume that it can only grow worse
> once it has turned negative the first time so leave the loop after the
> first negative number and choose the clusters number associated with the
> biggest positive silhouette value.
> This procedure would spare a lot of CPU time.

Another approach might involve the stepFlexclust() from the flexclust package. 
See the manual page for this function for examples. 

Dylan

> Thank you very much,
> Maura
>
> On Thu, Oct 30, 2008 at 7:25 PM, Dylan Beaudette
>
> <dylan.beaudette at gmail.com>wrote:
> > On Thursday 30 October 2008, Maura E Monville wrote:
> > > I have a pretty big similarity matrix (2870x2870). I will produce even
> > > bigger ones soon.
> > > I am using PAM to generate clusters.
> > > The desired number of output clusters is a PAM input parameter.
> > > I do not know  a-priopri what is the best clusters layout .
> > > I resorted to the silhouette test. It takes forever as I have to run
> > > PAM with all possible
> > > numbers of clusters.
> > > I wonder whether there is some faster method, either a s/w code or some
> > > theoretical guidelines,
> > > to get the optimum clusters number.
> > >
> > > Thank you very much,
> >
> > This is a very general topic in the field of multivariate analysis. There
> > really isn't any way to know the 'correct' number of clusters, however
> > there
> > are several metrics that can give you an indication of how messy your
> > data are.
> >
> > For information on the methods in the cluster package, see this book:
> >
> > Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data An Introduction to
> > Cluster Analysis Wiley-Interscience, 2005
> >
> > Otherwise, consider a book on multivariate analysis. Alternatively, try a
> > hierarchical clustering approach, and look for meaningful groupings. Some
> > thing like this:
> >
> > d <- diana(daisy(your_data_matrix))
> > d.hc <- as.hclust(d)
> >
> > d.hc$labels <- your_data_matrix$id
> >
> > plot(d.hc)
> >
> > Cheers,
> >
> > Dylan
> >
> >
> > --
> > Dylan Beaudette
> > Soil Resource Laboratory
> > http://casoilresource.lawr.ucdavis.edu/
> > University of California at Davis
> > 530.754.7341

-- 
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341