[R] problems with a large data set
Prof Brian Ripley
ripley at stats.ox.ac.uk
Wed Apr 25 18:42:12 CEST 2001
On Wed, 25 Apr 2001, Moritz Lennert wrote:
> I have trouble with a data set that comprises 2136 lines of 20 columns.
> I would like to do a hierarchical clustering and I tried the following:
> ages.hclust <- hclust(dist(ages, method="euclidean"), "ward")
> but I get the following error message:
> Error: cannot allocate vector of size 17797 Kb
> When I try to do the dist() alone first without the hclust(), I get the
> same type of message.
> Then I tried with the RPgSQL packages by typing
> Connected to database "space" on "localhost"
> > bind.db.proxy("ages")
> > ages.hclust <- hclust(dist(ages, method="euclidean"), "ward")
That does not help. You need to retrieve the data to use it!
> This time I get:
> Error in dist(ages, method = "euclidean") :
> NA/NaN/Inf in foreign function call (arg 1)
> In addition: Warning message:
> NAs introduced by coercion
> I've checked, and I can't find any missing values of something similar.
> Could someone tell me if I'm doing something wrong, or wether this is
> just too much data for R ?
This may be too much data for your computer, but not for R: I've
just done this in a few seconds. I suggest that you need more memory
(real or virtual): on my simulation it used about 80Mb.
I should say that doing agglomerative hierarchical cluster on thousands of
points makes little sense: it is a not a good way to find large clusters:
try a partitioning method like kmeans or clara (in package cluster).
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
More information about the R-help