[R] Trouble with (Very) Simple Clustering

Lorenzo Isella lorenzo.isella at gmail.com
Mon Jun 6 17:07:53 CEST 2016


Dear All,
I am doing something extremly basic (and I do not claim at all there
is no other way to achieve the same): I have a list of numbers and I
would like to split them up into clusters.
This is what I do: I see each number as a 1D vector and I calculate
the euclidean distance between them.
I get a distance matrix which I then feed to a hierarchical clustering
algorithm.
For instance consider the following snippet


#########################################################
data_mat<-structure(c(50.1361524639595, 48.2314746179241, 30.3803078462882,
29.2679787220381, 25.5125237513957, 22.9052912406594,
21.3890604699407,
15.5680557012965, 15.322981489303, 8.36693180374788, 7.23530025890675,
6.51469907237986, 5.42861828441895, 4.61986804112007,
4.33660782487196,
3.89915821225882, 3.67394875259037, 2.32719820674605,
1.88489249113792,
1.62276579528843, 1.56048239182126, 1.49722163565454,
1.32492151010636,
1.28216249552147, 1.272235253501, 0.734274800585336,
0.326949583587343,
0.318777047947951), .Dim = c(28L, 1L), .Dimnames = list(c("EE",
"LV", "RO", "BG", "SK", "CY", "LT", "MT", "PL", "NL", "EL", "PT",
"CZ", "SE", "UK", "LU", "HR", "DK", "AT", "SI", "IE", "ES", "FI",
"FR", "DE", "IT", "HU", "BE"), NULL))



distMatrix <- dist(data_mat)

n_clus<-5 ## I arbitrarily choose to have 5 clusters

hc <- hclust(distMatrix , method="ward.D2")



groups <- cutree(hc, k=n_clus) # cut tree into 5 clusters

pdf("cluster1.pdf")
plot(hc, labels = , hang = -1, main="Mobility to Business",
 yaxt='n' , ann=FALSE
  )
  rect.hclust(hc, k=n_clus, border="red")
  dev.off()

######################################################

which gives me very reasonable results.

Now, I would like to be able to find the optimal number of cluster on
the same data.

Based on what I found

http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/

http://www.statmethods.net/advstats/cluster.html

pvclust is a sensible way to go. However, when I try to use it on my
data, I get an error

> fit <- pvclust(t(data_mat),
> method.hclust="ward.D2",method.dist="euclidean")
Error in FUN(X[[i]], ...) : invalid scale parameter(r)


does anybody understand what is my mistake?
Many thanks

Lorenzo



More information about the R-help mailing list