[R] cluster analysis and null hypothesis testing

Christian Hennig fm3a004 at math.uni-hamburg.de
Wed Sep 15 18:09:18 CEST 2004


Hi,

testing the "randomness of a cluster analysis" is not a well defined
problem, because it depends crucially on your null model. In fpc, there is
nothing like this. Function prabtest in package prabclus performs such a
test, but this is for a particular data structure, namely presence-absence
data in biogeography. 

In principle, a Monte Carlo test can be constructed (and thus implemented in
R) as follows:

1) You need a null model H_0, from which you generate data.
2) You need a test statistic T.
3) Compute T on your data (call it T_0).
4) Repeat k times:
 a) Generate data from H_0
 b) Compute T on the generated data.
5) The p-value is (K+1)/(k+1), where K is the number of generated datasets
   for which T<=T_0 (given that "T small" indicates the tendency of
   clustering). 

Standard choices for H_0 will be a normal or uniform distribution. (In
prabtest, it is a complicated distribution on presence-absence data.)
There are lots of possible choices of T. prabtest uses the ratio between  
the 25% smallest distances in the dataset and the 25% largest distances.
This should be reasonable in fairly general settings. For a discussion of
this and alternative choices (and references on them), you may take a look
into 

C. Hennig and B. Hausdorf:  Distance-based parametric bootstrap tests for
clustering of species ranges,  Computational
Statistics and Data Analysis 45 (2004), 875-896.

A preprint of this can be obtained from my web page.

If you want to test the significance of a solution from a particular cluster
analysis method, you should think about choosing T so that it is somehow
connected to the method. (In the Hennig and Hausdorf paper, there are for
example two alternatives discussed that are connected to Single Linkage.)

Best,
Christian 

On Wed, 15 Sep 2004, Patrick Giraudoux wrote:

> Hi,
> 
> I am wondering if a Monte Carlo method (or equivalent) exist permitting to test the randomness of a cluster analysis (eg got by
> hclust(). I went through the package "fpc" (maybe too superficially) but dit not find such method.
> 
> Thanks for any hint,
> 
> Patrick Giraudoux
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 

***********************************************************************
Christian Hennig
Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag-online.de




More information about the R-help mailing list