[R] Massive clustering job?

Christian Hennig fm3a004 at math.uni-hamburg.de
Wed Dec 15 13:16:35 CET 2004


Dear Dan,

I would think about transforming your columns in such a way (square
root, log?) that methods operating on n*p matrices and assuming
roughly elliptical within-clusters distributions such as kmeans or
clara, or, after dimension reduction, EMclust or fixmahal can be applied.
Maybe you can even do that on untransformed data (take a look at the
variable-wise distributions or 2-d scatterplots). 
You do not need a distance matrix then.

Christian

On Wed, 15 Dec 2004, Dan Bolser wrote:

> 
> Hi, 
> 
> I have ~40,000 rows in a database, each of which contains an id column and
> 20 additional columns of count data.
> 
> I want to cluster the rows based on these count vectors.
> 
> Their are ~1.6 billion possible 'distances' between pairs of vectors
> (cells in my distance matrix), so I need to do something smart.
> 
> Can R somehow handle this?
> 
> My first thought was to index the database with something that makes
> nearest neighbour lookup more efficient, and then use single linkage
> clustering. Is this kind of index implemented in R (by default when using
> single linkage)?
> 
> Also 'grouping' identical vectors is very easy. I tried making groups more
> fuzzy by using a hashing function over the count vectors, but my hash was
> too crude. Any way to do fuzzy grouping in R which scales well?
> 
> For example, removing identical vectors gives me ~30,000 rows (and ~900
> million pairs of distances). As an example of how fast I can group, the
> above query took 0.13 seconds in mysql (using an index over every element
> in the vector). However, if I tried to calculate a distance between every
> pair of non identical vectors (lets say I can calculate ~1000 eutlidian
> distances per second) it would take me ~10 days just to calculate the
> distance matrix.
> 
> Sorry for all the information. Any suggestions on how to cluster such a
> huge dataset (using R) would be appreciated.
> 
> Cheers,
> Dan.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 

***********************************************************************
Christian Hennig
Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag-online.de




More information about the R-help mailing list