[R] Cluster analysis with numeric and categorical variables

Christian Hennig chrish at stats.ucl.ac.uk
Tue Jun 3 13:58:40 CEST 2008


Dear Miha,

a general way to do this is as follows:
Define a distance measure by aggregating the 
Euclidean distance on the (X,Y)-space and the trivial 0-1 distance (0 if 
category is the same) on the categorial variable. Perform cluster analysis 
(whichever you want) on the resulting distance matrix.

Note that there is more than one way to do this. The 0-1-distance could be 
incorporated in the definition of the Euclidean distance (instead of 
(x_i-y_i)^2), or a weighted average of the distances in X-, Y- and 
categorial space could be computed. Weights of variables (including 
possibly rescaling) have to be decided. How to do this precisely should 
depend on the subject matter and prior information about variable 
importance etc. In absence of such information, you may standardise the 
variablewise sums of squared pairwise distances to be equal.

Hope this helps (and you can figure out the relevant R code yourself).

Christian

On Tue, 3 Jun 2008, Miha Staut wrote:

> Dear all,
>
> I would like to perform a clustering analysis on a data frame with two coordinate variables (X and Y) and a categorical variable where only a != b can be established.  As far as I understood classification analyses, they are not an option as they partition the training set only in k classes of the test set.  By searching through the book "Modern Applied Statistics with S" I did not find a satisfactory solution.
>
> I will be grateful for any suggestions.
>
> Best regards
> Miha
>
>
>
>      __________________________________________________________
> can.html
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche



More information about the R-help mailing list