[R] Cluster analysis with numeric and categorical variables
chrish at stats.ucl.ac.uk
Tue Jun 3 13:58:40 CEST 2008
a general way to do this is as follows:
Define a distance measure by aggregating the
Euclidean distance on the (X,Y)-space and the trivial 0-1 distance (0 if
category is the same) on the categorial variable. Perform cluster analysis
(whichever you want) on the resulting distance matrix.
Note that there is more than one way to do this. The 0-1-distance could be
incorporated in the definition of the Euclidean distance (instead of
(x_i-y_i)^2), or a weighted average of the distances in X-, Y- and
categorial space could be computed. Weights of variables (including
possibly rescaling) have to be decided. How to do this precisely should
depend on the subject matter and prior information about variable
importance etc. In absence of such information, you may standardise the
variablewise sums of squared pairwise distances to be equal.
Hope this helps (and you can figure out the relevant R code yourself).
On Tue, 3 Jun 2008, Miha Staut wrote:
> Dear all,
> I would like to perform a clustering analysis on a data frame with two coordinate variables (X and Y) and a categorical variable where only a != b can be established. As far as I understood classification analyses, they are not an option as they partition the training set only in k classes of the test set. By searching through the book "Modern Applied Statistics with S" I did not find a satisfactory solution.
> I will be grateful for any suggestions.
> Best regards
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
*** --- ***
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
More information about the R-help