[R] problems with a large data set

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Apr 25 18:42:12 CEST 2001

On Wed, 25 Apr 2001, Moritz Lennert wrote:

> Hello,
> I have trouble with a data set that comprises 2136 lines of 20 columns.
> I would like to do a hierarchical clustering and I tried the following:
> ages.hclust <- hclust(dist(ages, method="euclidean"), "ward")
> but I get the following error message:
> Error: cannot allocate vector of size 17797 Kb
> When I try to do the dist() alone first without the hclust(), I get the
> same type of message.
> Then I tried with the RPgSQL packages by typing
> >db.connect(dbname="space")
> Connected to database "space" on "localhost"
> > bind.db.proxy("ages")
> > ages.hclust <- hclust(dist(ages, method="euclidean"), "ward")

That does not help. You need to retrieve the data to use it!

> This time I get:
> Error in dist(ages, method = "euclidean") :
>         NA/NaN/Inf in foreign function call (arg 1)
> In addition: Warning message:
> NAs introduced by coercion
> I've checked, and I can't find any missing values of something similar.
> Could someone tell me if I'm doing something wrong, or wether this is
> just too much data for R ?

This may be too much data for your computer, but not for R: I've
just done this in a few seconds.  I suggest that you need more memory
(real or virtual): on my simulation it used about 80Mb.

I should say that doing agglomerative hierarchical cluster on thousands of
points makes little sense: it is a not a good way to find large clusters:
try a partitioning method like kmeans or clara (in package cluster).

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list