AW: [R] estimating number of clusters ("Null or more")

Thu Apr 24 15:11:34 CEST 2003

Dear Christian,

  first of all thank you for your answer. I am going to parse through 
  the pages you told me. Meanwhile I'd like to note that probably it 
  is a good idea to put 2-3 lines of R-code demonstrating such a 
  simple needs somnewhere in docs of `cluster' package. E.g.

  x<-rnorm(500)
  ... # output means we have rather 1 claster

  x<-c(rnorm(500), rnorm(500)+5)
  ... # output means we have rather 2 or more claster

  It would be nice not only for me.

> EMclust of library mclust decides about an optimal number of mixture
> components using the BIC.

It is not clear for me whether one could use BIC without a
statement about the familiy of distribution. Indeed BIC is based 
on likelihood, and what the likelihood should be if the only 
adequate statement about the destribution is the ECDF itself?..

> As far as I know, there is no direct answer to the problem of testing
> homogeneity vs. clustering in R. There are lots of 
> theoretical difficultiesand there is no "standard routine" to 
> do this, neither in R, nor elsewhere.

I am not looking for the Holy Grail, or I hope so :-)

In particular, I beleive some entropy-based criteria should 
fully satisfy me here. BIC might be also good if it might be 
applied to a ECDF.

> I would suggest to invent a null model for your  
> data modelled as
> homogeneous and to estimate the distribution of a 
> suitable clustering
> statistics (such as the silhouette avg.width in pam, 
> BIC, average
> distance of the points to kth nearest neighbor or ratio 
> between 25% largest
> and smallest distances in the dataset) by Monte
> Carlo/parametric bootstrap. Perhaps I say this too quickly; 

a bit compressed, but something is clear anyway :-)

> it's non-trivial and at least you have to design the 
> simulation so that rejection/acceptance is not a 
> consequence of different scaling of data and null model. 

not clear here :-)

thanks again
Valery A.Khamenya