[R] finding a stable cluster for kmeans

Wiebke Timm wtimm at techfak.uni-bielefeld.de
Wed Sep 26 16:58:34 CEST 2007


On 26.09.2007, at 09:54, Prof Brian Ripley wrote:

> On Wed, 26 Sep 2007, Friedrich Leisch wrote:
>
>>>>>>> On Tue, 25 Sep 2007 20:16:05 -0400,
>>>>>>> Wiebke Timm (WT) wrote:
>>
>>  > You might want to check if there is a neural gas algorithm in R.
>>  > kmeans generally has a high variance since it is very dependent on
>>  > the initialization. Neural gas overcomes this problem by using a
>>  > ranked list of neighbouring data points instead using data points
>>  > directly. It is more stable (at the cost of additional  
>> computational
>>  > time).
>>
>> Neural gas is in package flexclust on CRAN (one of the clustering
>> methods function cclust() privides).
>>
>> I also find it more stable than kmeans for some data, although in
>> general I agree with what has been said before in this thread:
>> instability is in most cases caused by no clear cluster structure of
>> the data, wrong number of clusters etc rather than by the wrong
>> cluster algorithm.

> I don't understand this use of 'high variance' and 'stable'.

high variance: Each time you run k-means (with the same number of  
prototypes and the same data, and random prototype placement) the  
result's goodness of fit varies.

stable: Results are reproducible.

> K-means is a clearly defined criterion (rare in the clustering  
> field) and so the outcome does not depend on the initialization.

I suggest having a look at http://pubs.acs.org/cgi-bin/abstract.cgi/ 
jcisd8/2002/42/i06/abs/ci020270w.html

"The K-means method is usually applied for partitioning
data into k clusters. MacQueen’s K-means seems to be the
most popular. It is known that K-means suffers from the
initial random selection of cluster centers, which has a crucial
impact upon the final results.1 Moreover, the convergence
to an optimal solution of the cost function is not guaranteed.1"

And there are more sources which state that it does depend. Vogt et  
al state in http://www.clinchem.org/cgi/content/abstract/38/2/182 ,  
refering to the Hartigan algorithm:

"3. The result of the classification depends on the number of groups  
chosen, the starting partition, and the order of objects."

I know of other sources that made this observation.

Wir k-means, even with the right number of prototypes, it can happen  
that, due to a very unlucky initialization of prototype, one cluster  
gets 2 prototypes and another cluster does not get one.

> Maybe different runs of kmeans() give different clusters, but in  
> that case the algorithm is not optimizing the criterion in some  
> (and maybe all) cases.  ?kmeans clearly says
>
>      The Hartigan-Wong algorithm
>      generally does a better job than either of those, but trying
>      several random starts is often recommended.

Yes, and this is because it depends on the initalization i.e. the  
initial placement of prototypes.

IMO it is more convenient to have an algorithm that you do not have  
to run 50 times and then pick the best result among many different  
outcomes (k-means) but only once, because you know it will reproduce  
that behaviour (neural gas).

> And that there are several clusterings with roughly equally good  
> fit would indicate that none of them is a uniquely good summary of  
> the data.

Yes, but the problem is that you have fits of different quality each  
time you rerun. That's what I meant by "not stable".

Regards,
   Wiebke



More information about the R-help mailing list