[R] Clustering Large Applications..sort of

Christian Hennig chrish at stats.ucl.ac.uk
Thu Aug 11 01:13:28 CEST 2011


There is a number of methods in the literature to decide the number of 
clusters for k-means. Probably the most popular one is the Calinski and 
Harabasz index, implemented as calinhara in package fpc. A distance 
based version (and several other indexes to do this) is in function 
cluster.stats in the same package.

Christian

On Wed, 10 Aug 2011, Ken Hutchison wrote:

> Hello all,
>   I am using the clustering functions in R in order to work with large
> masses of binary time series data, however the clustering functions do not
> seem able to fit this size of practical problem. Library 'hclust' is good
> (though it may be sub par for this size of problem, thus doubly poor for
> this application) in that I do not want to make assumptions about the number
> of clusters present, also due to computational resources and time hclust is
> not functionally good enough; furthermore k-means works fine assuming the
> number of clusters within the data, which is not realistic. The silhouette
> functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem
> to be really bad through very thorough experimentation of data generation
> with known clusters. I am left then with either theoretical abstractions
> such as pruning hclust trees with minimal spanning trees or perhaps
> hand-rolling a hierarchical k-medoids which works extremely efficiently and
> without cluster number assumptions. Anybody have any suggestions as to
> possible libraries which I have missed or suggestions in general? Note: this
> is not a question for 'Bigkmeans' unless there exists a
> 'findbigkmeansnumberofclusters' function also.
>                                        Thank you in advance for your
> assistance,
>                                             Ken
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche



More information about the R-help mailing list