[R] define number of clusters in kmeans/apcluster analysis

Ulrich Bodenhofer bodenhofer at bioinf.jku.at
Tue Dec 15 12:57:23 CET 2015


Dear Luigi,

As the others have replied already, you cannot expect a clustering 
algorithm to produce exactly the result that you expect intuitively. The 
results of clustering algorithms depend largely on the parameters and, 
even more importantly, on the distance/similarity measure that is used. 
k-means, for instance, uses the Euclidean distance. As a result, it 
works nicely for spherical clusters that have approximately the same 
radius. APCluster, unless you don't choose a different similarity, uses 
negative squared distances which leads to very similar properties. Your 
data set consists of two clusters, one of which is much more spread out. 
That some parts of the larger cluster are being assigned to the other 
cluster looks weird, but it is perfectly explained by the properties of 
the algorithms. There is a lot of literature about the properties of 
clustering algorithms around. That's my 2 cents about this. In your 
case, however, as already pointed out in Bill Dunlap's reply, the 
scaling is the more important issue. k-means and apcluster do not 
perform any scaling of the data. Your two axes differ strongly in terms 
of scaling. Enter the following to see how the two clustering algorithms 
"see" your data (i.e. with two equally scaled axes):

     plot(z, xlim=c(0, 50), ylim=c(0, 50))

Given this, it is no longer surprising that both algorithms split the 
data in the way they do.

Actually, if you re-scale the data, apcluster produces the result you 
expect:

    z2 <- scale(z)
    m <- apclusterK(negDistMat(r=2), z2, K=2, verbose=TRUE)
    plot(m, z2)
    plot(m, z) ## it even works to superimpose the clustering result on
    the original data

I hope that helps.

Best regards,
Ulrich



More information about the R-help mailing list