[R] K-means recluster data with given cluster centers

t.peter.Mueller at gmx.net t.peter.Mueller at gmx.net
Mon Jan 11 13:19:32 CET 2010


K-means recluster data with given cluster centers

Dear R user,

I have several large data sets. Over time additional new data sets will be created.
I want to cluster all the data in a similar/ identical way with the k-means algorithm.

With the first data set I will find my cluster centers and save the cluster centers to a file [1].
This first data set is huge, it is guarantied that cluster centers will converge.

Afterwards I load my cluster centers and cluster via k-means all other datasets with the same cluster centers [2].

I tried this but now I'm getting in the reclustering step following error message:
"Error: empty cluster: try a better set of initial centers"

That one of the clusters is empty (has no datapoint) should not be a problem. This can happen because the new data sets can be smaller. 
What am I doing wrong? Is there a other way to cluster new data in the same way like the old datasets?

Thanks
Peter


1: R code to find cluster center and save them to file
   #---INITIAL CLUSTERING TO FIND CLUSTER CENTERS
   # LOAD LIB
   library(cluster)

   # LOAD DATA
   data_unclean <- read.table("dataset1.dat")
   data.matrix<-as.matrix(data_unclean,"any")

   # CLUSTER
   Nclust <- 100 # amount cluster centers
   Imax <- 200 # amount of iteration for convergence of clustering
   set.seed(100) # set seed of random nr generator
   init <- sample(dim(data.matrix)[1], Nclust) # this is the initial Nclust prototypes
   km <- kmeans(data.matrix, centers=data.matrix[init,], iter.max=Imax)

   # WRITE OUT CLUSTER CENTERS
   km$centers # print cluster center (columns: dim component; rows: clusters)
   km$size # print amount of data in each cluster
   clusterCenters=km$centers
   save(file="clusterCenters.RData", list='clusterCenters') # Beispiel
   write.table(km$centers, file = "clusterCenters.dat", sep = ",", col.names= FALSE, row.names= FALSE)


2: R code to recluster new data
   #---RECLUSTER NEW DATA WITH GIVEN CLUSTER CENTERS
   # LOAD LIB, SET PARAMETER
   library(cluster)
   loopStart="0"
   loopEnd="10"

   # LOAD CLUSTER CENTER
   load("clusterCenters.RData") # load cluster centers

   # LOOP OVER TRAJ AND RECLUSTER THEM
   for(ii in loopStart:loopEnd){
        # DEFINE FILENAME
        #print(paste("test",ii,sep=""))
        filenameInput=paste("dataset",ii,"dat",sep="")
        filenameOutput=paste("dataset",ii,"datClusters",sep="")
        print(filenameInput)
        print(filenameOutput)

        # LOAD DATA
        data_unclean <- read.table(filenameInput)
        data.matrix<-as.matrix(data_unclean,"any")

        # RECLUSTER DATA
        kmRecluster <- kmeans(data.matrix, centers=clusterCenters, iter.max=1)
        kmRecluster$size

        # WRITE OUT CLUSTERS FOR EACH DATA
        write.table(kmRecluster$cluster, file = filenameOutput, sep = ",", col.names= FALSE, row.names= FALSE)
   }

-- 
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser



More information about the R-help mailing list