[R] cluster analysis and supervised classification: an alternative to knn1?

Christian Hennig chrish at stats.ucl.ac.uk
Thu May 27 13:09:37 CEST 2010

Dear abanero,

In principle, k nearest neighbours classification can be computed on 
any dissimilarity matrix. Unfortunately, knn and knn1 seem to assume 
Euclidean vectors as input, which restricts their use.

I'd probably compute an appropriate dissimilarity between points (have a 
look at Gower's distance in daisy, package cluster), and the implement 
nearest neighbours classification myself if I needed it. It should be 
pretty straightforward to implement.

If you want unsupervised classification (clustering) instead, you have the 
choice between all kinds of dissimilarity based algorithms then (hclust, pam, 
agnes etc.)


On Thu, 27 May 2010, Ulrich Bodenhofer wrote:

> abanero wrote:
>> Do you know  something like “knn1” that works with categorical variables
>> too?
>> Do you have any suggestion? 
> There are surely plenty of clustering algorithms around that do not require
> a vector space structure on the inputs (like KNN does). I think
> agglomerative clustering would solve the problem as well as a kernel-based
> clustering (assuming that you have a way to positive semi-definite measure
> of the similarity of two samples). Probably the simplest way is Affinity
> Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation;
> see CRAN package "apcluster" I have co-developed). All you need is a way of
> measuring the similarity of samples which is straightforward both for
> numerical and categorical variables - as well as for mixtures of both (the
> choice of the similarity measures and how to aggregate the different
> variables is left to you, of course). Your final "classification" task can
> be accomplished simply by assigning the new sample to the cluster whose
> exemplar is most similar.
> Joris Meys wrote:
>> Not a direct answer, but from your description it looks like you are
>> better
>> of with supervised classification algorithms instead of unsupervised
>> clustering. 
> If you say that this is a purely supervised task that can be solved without
> clustering, I disagree. abanero does not mention any class labels. So it
> seems to me that it is indeed necessary to do unsupervised clustering first.
> However, I agree that the second task of assigning new samples to
> clusters/classes/whatever can also be solved by almost any supervised
> technique if samples are labeled according to their cluster membership
> first.
> Cheers, Ulrich
> -- 
> View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html
> Sent from the R help mailing list archive at Nabble.com.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

More information about the R-help mailing list