[R] Cluster analysis, factor variables, large data set

Peter Langfelder peter.langfelder at gmail.com
Thu Mar 31 21:22:54 CEST 2011


On Thu, Mar 31, 2011 at 11:48 AM, Hans Ekbrand <hans at sociologi.cjb.net> wrote:
>
> The variables are unordered factors, stored as integers 1:9, where
>
> 1 means "Full-time employment"
> 2 means "Part-time employment"
> 3 means "Student"
> 4 means "Full-time self-employee"
> ...
>
> Does euclidean distances make sense on unordered factors coded as
> integers?

It probably doesn't. You said you have some 36 observations for each
case, correct? You can turn these 36 observations into a vector of
length 36 * 9 on which Euclidean distance will make some sense, namely
k changes will produce a distance of sqrt(2*k). For each observation
with value p (p between 1 and 9), create a vector r = c(0,0,1,0,...0)
where the entry 1 is in the p-th component. Hence, if values p1 and p2
are the same, euclidean distance between r1 and r2 is zero; if they
are not the same, Euclidan distance is sqrt(2).

Here's some possible R code:


transform = function(obsVector, maxVal)
{
  templateMat = matrix(0, maxVal, maxVal);
  diag(templateMat) = 1;

  return(as.vector(templateMat[, obsVector]));
}

set.seed(10)
n = 4;
m = 5;
max = 4;
data = matrix(sample(c(1:max), n*m, replace = TRUE), m, n);

> data
     [,1] [,2] [,3] [,4]
[1,]    3    3    1    2
[2,]    1    3    3    2
[3,]    3    3    2    4
[4,]    1    2    4    2
[5,]    4    1    4    1


trafoData = apply(data, 2, transform, maxVal = max);

> trafoData
      [,1] [,2] [,3] [,4]
 [1,]    0    0    1    0
 [2,]    0    0    0    1
 [3,]    1    1    0    0
 [4,]    0    0    0    0
 [5,]    1    0    0    0
 [6,]    0    0    0    1
 [7,]    0    1    1    0
 [8,]    0    0    0    0
 [9,]    0    0    0    0
[10,]    0    0    1    0
[11,]    1    1    0    0
[12,]    0    0    0    1
[13,]    1    0    0    0
[14,]    0    1    0    1
[15,]    0    0    0    0
[16,]    0    0    1    0
[17,]    0    1    0    1
[18,]    0    0    0    0
[19,]    0    0    0    0
[20,]    1    0    1    0



The code assumes that cases are in columns and observations in rows of
data. Examine data and trafoData to see how the transformation works.
Once you have the transformed data, simply apply your favorite
clustering method that uses Euclidean distance.

HTH,

Peter

>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list