[R] what is used as height in hclust for ward linkage?

james.foadi at diamond.ac.uk james.foadi at diamond.ac.uk
Fri Dec 2 16:03:58 CET 2011


Dear R community,
I am trying to understand how the ward linkage works from a quantitative point of view.
To test it I have devised a simple 3-members set:

                           G = c(0,2,10)

The distances between all couples are:

d(0,2)  =  2
d(0,10) = 10
d(2,10) =  8

The smallest distance corresponds to merging 0 and 2. The corresponding ESS are:

ESS(0,2) = 2*var(c(0,2)) = 4
ESS(0,10) = 2*var(c(0,10)) = 100
ESS(2,10) = 2*var(c(2,10)) = 64

and, indeed, the smallest ESS corresponds to merging 0 and 2. The next element that should be added
to 0 and 2 is obviously 10. This is where I don't understand how the hclust algorithm in R works. We have

> G <- c(0,2,10)
> G.dist <- dist(G)
> G.hc <- hclust(G.dist,method="ward")
> G.hc$merge
     [,1] [,2]
[1,]   -1   -2
[2,]   -3    1
> G.hc$height
[1]  2.00000 11.33333

Now, according to standard definitions, the distance between two clusters with elements Nr and Ns is:

                          d(Rs,Rr) = sqrt(2*Nr*Ns/(Nr+Ns))*||<Rs> - <Rr>||

where < > in the last expression indicates averages (centroids). If I carry out this operation to merge cluster
c(0,2) with 10, I get:

                          d(c(0,2),10) = sqrt(2*2*1/(2+1))*|1-9| = 9.237604

This is different from 11.3333 in the R output.

Does anyone know what's the exact value for the ward linkage, as displayed in the hclust height output?

Thanks in advance for any help!

J


-- 
This e-mail and any attachments may contain confidential...{{dropped:8}}



More information about the R-help mailing list