[R] references on cluster analysis

Martin Maechler maechler at stat.math.ethz.ch
Sat Feb 21 14:35:16 CET 2004


Back from my vacation, I haven't seen an R-help answer on this
  (Christian, where have you been ? ;-)

>>>>> "GiampS" == Giampiero Salvi <giampi at speech.kth.se>
>>>>>     on Sat, 7 Feb 2004 23:40:36 +0100 (CET) writes:

    GiampS> Hi all, I'm doing a study on predicting the "true"
    GiampS> number of clusters in a hierarchical clustering
    GiampS> scheme. My main reference is at the moment

    GiampS> Milligan GW and Cooper MC (1985) "An examination of
    GiampS> procedures for determining the number of clusters in
    GiampS> a data set" Psychometrika vol 50 no 2 pp 159-179

    GiampS> and all the references included in that paper.

(not available to me)

    GiampS> I'm planning to perform a similar comparison on a
    GiampS> number of indexes, but on a much larger data set (in
    GiampS> the order of 3000 points), and with a much higher
    GiampS> "true" number of clusters (in the order of some
    GiampS> hundreds), to see if the properties of the indexes
    GiampS> scale accordingly.

    GiampS> I was wondering if the set of indexes described in
    GiampS> the reference are still "state of the art" (most of
    GiampS> them were introduced in the '60s and '70s), or if
    GiampS> there are new indexes and methods I could include in
    GiampS> my study. I would really appreciate if you could
    GiampS> point me to some newer references addressing this problem.

Gordon's 2nd edition,

  author =	 {A. D. Gordon},
  title = 	 {Classification, 2nd Edition},
  publisher = 	 {Chappman \& Hall/CRC},
  year = 	 1999,
  series =	 {Monographs on Statistics and Applied Probability 82},
  edition =	 {2nd edition}

has a whole chapter (one of the last ones in the book) on this.

R's cluster package has a generic silhouette() function (with 2 methods),
and plot.silhouette() method --- all are improvements from
Kaufman & Rousseeuw's original code.

A recent research paper using "CLEST" (Fridyland & Dudoit),
mentioning "GAP" (Tibshirani) etc etc  still find silhouette
among the best "indices" for determining the number of clusters.

A student's (master) thesis here seems to point in the same
direction.

    GiampS> I also read Milligan's chapter in the book
    GiampS> "Clustering and Classification" from 1995, 
(which book? author?)

    GiampS> but didn't find information on this subject that wasn't
    GiampS> included in the previous paper.

Regards,
Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><




More information about the R-help mailing list