[R] references on cluster analysis

Murray Jorgensen maj at stats.waikato.ac.nz
Mon Feb 23 01:56:31 CET 2004


I don't really believe that there is any satisfactory definition of the 
"true number of clusters" let along a procedure that would reliably find it.

Murray Jorgensen


Martin Maechler wrote:

> Back from my vacation, I haven't seen an R-help answer on this
>   (Christian, where have you been ? ;-)
> 
> 
>>>>>>"GiampS" == Giampiero Salvi <giampi at speech.kth.se>
>>>>>>    on Sat, 7 Feb 2004 23:40:36 +0100 (CET) writes:
> 
> 
>     GiampS> Hi all, I'm doing a study on predicting the "true"
>     GiampS> number of clusters in a hierarchical clustering
>     GiampS> scheme. My main reference is at the moment
> 
>     GiampS> Milligan GW and Cooper MC (1985) "An examination of
>     GiampS> procedures for determining the number of clusters in
>     GiampS> a data set" Psychometrika vol 50 no 2 pp 159-179
> 
>     GiampS> and all the references included in that paper.
> 
> (not available to me)
> 
>     GiampS> I'm planning to perform a similar comparison on a
>     GiampS> number of indexes, but on a much larger data set (in
>     GiampS> the order of 3000 points), and with a much higher
>     GiampS> "true" number of clusters (in the order of some
>     GiampS> hundreds), to see if the properties of the indexes
>     GiampS> scale accordingly.
> 
>     GiampS> I was wondering if the set of indexes described in
>     GiampS> the reference are still "state of the art" (most of
>     GiampS> them were introduced in the '60s and '70s), or if
>     GiampS> there are new indexes and methods I could include in
>     GiampS> my study. I would really appreciate if you could
>     GiampS> point me to some newer references addressing this problem.
> 
> Gordon's 2nd edition,
> 
>   author =	 {A. D. Gordon},
>   title = 	 {Classification, 2nd Edition},
>   publisher = 	 {Chappman \& Hall/CRC},
>   year = 	 1999,
>   series =	 {Monographs on Statistics and Applied Probability 82},
>   edition =	 {2nd edition}
> 
> has a whole chapter (one of the last ones in the book) on this.
> 
> R's cluster package has a generic silhouette() function (with 2 methods),
> and plot.silhouette() method --- all are improvements from
> Kaufman & Rousseeuw's original code.
> 
> A recent research paper using "CLEST" (Fridyland & Dudoit),
> mentioning "GAP" (Tibshirani) etc etc  still find silhouette
> among the best "indices" for determining the number of clusters.
> 
> A student's (master) thesis here seems to point in the same
> direction.
> 
>     GiampS> I also read Milligan's chapter in the book
>     GiampS> "Clustering and Classification" from 1995, 
> (which book? author?)
> 
>     GiampS> but didn't find information on this subject that wasn't
>     GiampS> included in the previous paper.
> 
> Regards,
> Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
> Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
> ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
> phone: x-41-1-632-3408		fax: ...-1228			<><
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 
> 

-- 
Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862




More information about the R-help mailing list