[R] EM unsupervised clustering

Wed Jul 18 15:37:36 CEST 2007

Hi All,

I have a  n x m matrix. The n rows are individuals, the m columns are variables.

The matrix is in itself a collection of 1s (if a variable is observed for an 
individual), and 0s (is there is no observation).

Something like:

      [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    0    1    1    0    0
[2,]    1    0    1    1    0    0
[3,]    1    0    1    1    0    0
[4,]    0    1    0    0    0    0
[5,]    1    0    1    1    0    0
[6,]    0    1    0    0    1    0

I use kmeans to find 2 or 3 clusters in this matrix

k2 = kmeans(data, 2, 10000000)
k3 = kmeans(data, 3, 10000000)

but I would like to use something a bit more refined, so I though about a EM 
based clustering. I am using the Mclust() function from the mclust package, but 
I get the following (to me incomprehensible) error message:

plot(Mclust(as.data.frame(data)), as.data.frame(data))
Hit <Return> to see next plot:
Hit <Return> to see next plot:
Hit <Return> to see next plot:
Error in 1:L : NA/NaN argument
In addition: Warning messages:
1: best model occurs at the min or max # of components considered in: 
summary.mclustBIC(Bic, data, G = G, modelNames = modelNames)
2: optimal number of clusters occurs at min choice in: 
Mclust(as.data.frame(anc.st.mat))
3: insufficient input for specified plot in: coordProj(data = data, parameters = 
x$parameters, z = x$z, what = "classification",

That's puzzling because the example given by ?Mclust is something like

plot(Mclust(iris[,-5]), iris[,-5])

which is pretty simple and dumbproof and works flawlessly...

best,

Federico

-- 
Federico C. F. Calboli
Department of Epidemiology and Public Health
Imperial College, St Mary's Campus
Norfolk Place, London W2 1PG

Tel  +44 (0)20 7594 1602     Fax (+44) 020 7594 3193

f.calboli [.a.t] imperial.ac.uk
f.calboli [.a.t] gmail.com