[R] estimating number of clusters ("Null or more")

Christian Hennig hennig at stat.math.ethz.ch
Thu Apr 24 14:25:20 CEST 2003


Hi,

there are at least two methods to estimate the number of clusters in R:
In library(cluster), you can use the information coming with the 
silhouette plot. This is a bit difficult to figure out from the help pages
(it got better in the recent version, I think), and you can find it out
reading help pages of pam, pam.object and partition.object.

EMclust of library mclust decides about an optimal number of mixture
components using the BIC.

As far as I know, there is no direct answer to the problem of testing
homogeneity vs. clustering in R. There are lots of theoretical difficulties
and there is no "standard routine" to do this, neither in R, nor
elsewhere. I would suggest to invent a null model for your data modelled as
homogeneous and to estimate the distribution of a suitable clustering
statistics (such as the silhouette avg.width in pam, BIC, average
distance of the points to kth nearest neighbor or ratio between 25% largest
and smallest distances in the dataset) by Monte
Carlo/parametric bootstrap. Perhaps I say this too quickly; it's
non-trivial and at least you have to design the simulation so that
rejection/acceptance is not a consequence of different scaling of data and
null model. 

Hope that helps,
Christian

On Thu, 24 Apr 2003, Khamenia, Valery wrote:

> Hi all,
> 
>   once more about the old subj :-)
> 
>   My data has too much various distribution families and for every
> particular experiment 
>   I need just to decide whether the data is "quite homogeneous" or it has
> two or more 
>   clusters. I've revisited the following libraries: 
>          amap, clust, cclust, mclust, multiv, normix, survey.
> 
>   And I didn't find any ready-to-use general purpose criterion for answering
> 
>   the question whether the data is "quite homogeneous" or has two or more 
>   clusters. Even for one dimension data.
> 
>   However, in "cclust" a "clustIndex" might be used as a raw criteria.
>   But nothing ready to use as far as I understand. Or maybe I am wrong?!
> 
>   Q: are there any libraries in R with ready-to-use functions for estimation
> 
>        number of clusters...
>        - ... with criterion based on entropy?
>        - ... with criterion based on ecdf?
> 
> Please Cc to:
> 
>    vkhamenia at biovision.de
> 
> kind thanks.
> ---------------------------------------------------------------------------
> Valery A.Khamenya
> Bioinformatics Department
> BioVisioN AG, Hannover
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> 

-- 
***********************************************************************
Christian Hennig
Seminar fuer Statistik, ETH-Zentrum (LEO), CH-8092 Zuerich (currently)
and Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
hennig at stat.math.ethz.ch, http://stat.ethz.ch/~hennig/
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag.de



More information about the R-help mailing list