[Rd] hclust() and agnes() method="average" divergence (PR#3648)

Thu Aug 14 10:25:27 MEST 2003

>>>>> "MikG" == m grum <m.grum at cgiar.org>
>>>>>     on Mon, 4 Aug 2003 08:51:30 +0200 (MET DST) writes:

    MikG> Anyone have a clue why hclust() and agnes() produce
    MikG> different results in the example below when both use
    MikG> method="average"??  I'm not able to reproduce the
    MikG> problem with other datasets.

    MikG> ereck <- read.table("Ereck.txt",header=TRUE,sep="\t")
    MikG> emol <- subset(ereck,select=c(11:18,20:32))
    MikG> library(cluster)
    MikG> library(mva)
    MikG> daisemol <- daisy(emol,type=list(asymm=c(1:21)))

The reason is that most of the distances/dissimilarities are the
same: there are only 20 different values in the 1326 distances.

> sort(table(daisemol), decreasing=TRUE)

starts as
>> 0.666666666666667               0.5               0.8 0.285714285714286 
>>               387               284               251                94 

i.e. the distance 2/3 appears 387 times,  1/2 does 284 times, etc.
With so many ties in the distances, choosing the next
observation / cluster for "merging" is often chosing among many
possibilities and hence the arbitrariness and the difference
between too algorithms.

For your situation, you might be able to use some continuous
variable along with the factors and the many binary ones such
that the distances won't have ties.

NO bug! {i.e. you should have posted to R-help (you did have a
good question!)} not R-bugs.

Regards,
Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><