[R] hierarchical clustering of large dataset

Peter Langfelder peter.langfelder at gmail.com
Fri Mar 9 21:54:34 CET 2012


On Thu, Mar 8, 2012 at 4:41 AM, Massimo Di Stefano
<massimodisasha at gmail.com> wrote:
>
> Hello All,
>
> i've a set of observations that is in the form :
>
> a,    b,    c,    d,    e,    f
> 67.12,    4.28,    1.7825,    30,    3,    16001
> 67.12,    4.28,    1.7825,    30,    3,    16001
> 66.57,    4.28,    1.355,    30,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 63.64,    9.726,    1.3004,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
>> ….
>
> 55.000 observation in total.

Hi Massimo,

you don't want to use the entire matrix to calculate the distance. You
will want to select the environmental columns and you may want to
standardize them to prevent one of them having more influence than
others.

Second, if you want to cluster such a huge data set using hierarchical
clustering, you need a lot of memory, at least 32GB but preferably
64GB. If you don't have that much, you cannot use hierarchical
clustering.

Third, if you do have enough memory, use package flashClust or
fastcluster (I am the maintainer of flashClust.)
For flashClust, you can install it using
install.packages("flashClust") and load it using library(flashClust).
The standard R implementation of hclust is unnecessarily slow (order
n^3). flashClust provides a replacement (function hclust) that is
approximately n^2. I have clustered data sets of 30000 variables in a
minute or two, so 55000 shouldn't take more than 4-5 minutes, again
assuming your computer has enough memory.

HTH,

Peter



More information about the R-help mailing list