[R] Cluster analysis, factor variables, large data set

Hans Ekbrand hans at sociologi.cjb.net
Thu Mar 31 19:46:27 CEST 2011


Dear R helpers,

I have a large data set with 36 variables and about 50.000 cases. The
variabels represent labour market status during 36 months, there are 8
different variable values (e.g. Full-time Employment, Student,...)

Only cases with at least one change in labour market status is
included in the data set.

To analyse sub sets of the data, I have used daisy in the
cluster-package to create a distance matrix and then used pam (or pamk
in the fpc-package), to get a k-medoids cluster-solution. Now I want
to analyse the whole set.

clara is said to cope with large data sets, but the first step in the
cluster analysis, the creation of the distance matrix must be done by
another function since clara only works with numeric data.

Is there an alternative to the daisy -> clara route that does not
require as much RAM?

What functions would you recommend for a cluster analysis of this kind
of data on large data set?


regards,

Hans Ekbrand



More information about the R-help mailing list