[R] memory issue trying to solve too large a problem using hclust

Fri Nov 30 21:04:29 CET 2001

Hi all, hi Matthew

I would like to extend this question and take the opportunity
to ask all the famous statisticians in this group for advice.

First a personal comment :-)
I am quite amused, how easy it is sometimes to find out on
which project someone writing to this group is working:
You mention that you want to cluster 12,500 objects. If I am
correct, you are trying to cluster the 12,500 genes on the
human Affymetrix GeneChip HgU95A, correct?
(At least this is what  I am just trying to do)

Now to the questions, which I wanted to ask for quite some time:

Since the time of the paper:
Eisen MB, Spellman PT, Brown PO, Botstein D.
Cluster analysis and display of genome-wide expression patterns.
Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8.
most biologists working on gene expression use hierarchical
clustering to cluster all genes they have on their DNA-chips.
Next year we will see chips containing more than 20,000 genes
on one chip.

Thus the question is:
1, What is the best way to cluster this amount of genes?
Sometimes, I have heard, you should first use k-means to
divide the genes into few subclusters, and use hierarchical
clustering for the subclusters only. Is this correct?

2, When you do hierarchical clustering, what metric would
be best to use?
M.Eisen´s paper describes Pearson correlation as metric.
Is there a way to implement this metric for use in hclust?
Sorrowly, hclust supports only euclid and manhattan.

3, R/S contain some other cluster algorithms such as CLARA,
PAM, FANNY, AGNES. However, I have never seen any paper on
expression profiling using these algorithms. Is there a special
reason, why these functions are not used?

4, Meanwhile, new methods for cluster analysis have been
developed. For example, the book "Data Mining" of Han&Kamber
mentions BIRCH, CURE, DBSCAN, OPTICS, DENCLUE, STINGS
as some of these new algorithms.
Would it make sense to use one of these methods?
Does someone know if implementations of these functions
do exist?

5, As I understand, there does not exist a single "best" cluster
algorithm for this purpose, but you have to try different methods,
and try to find out which one describes the data best.
This is often easy when you cluster samples, but is hard to
find out when trying to cluster 20,000 or even more genes.

6, Do there exist better methods other than clustering, which
could group genes with similar behavior?
PCA may be one method, but is based on dimensionality reduction,
which may not be applicable in many cases?

I know, that in this group questions to cluster many data have
partly been answered, but I have the feeling, that many of these
questions remain open, especially, when applied to expression
profiling.

I also know that many people working in this field use R/S
as their main tool, so any help would be appreciated not only
from me.

Best regards
Christian Stratowa
----------------------------------
C.h.r.i.s.t.i.a.n  S.t.r.a.t.o.w.a
V.i.e.n.n.a,  A.u.s.t.r.i.a

"Wiener, Matthew" wrote:

> Hi, all.
>
> I'm trying to cluster 12,500 objects using hclust from package mva.  The
> distance matrix takes up nearly 600 MB.  The distance matrix also needs to
> be copied when being passed to the fortran routine that actually does the
> clustering (it's modified during the clustering), so that's 1200 MB.  I'm
> actually on a machine with 2.5 GB of memory (and nothing else running), so I
> thought I could pull this off.  The routine quits with the error "cannot
> allocate a vector of size 609131 KB", which by its size seems to be another
> copy of the distance matrix, I think the one needed by the fortran routine.
> As far as I can tell from looking at the code, no additional objects of the
> size of the distance matrix are used.
>
> After the error gc() says that the garbage collection threshold is 1433 MB.
>
> I'm wondering whether some additional copies of the distance matrix are
> being made, and whether I could somehow stop them from being made.  Any
> other suggestions for how I could get around the memory problem would also
> be appreciated.  (I know of clara in the "cluster" package, but would like
> to use hierarchical methods.)
>
> The function hierclust in multiv seems to demand even more memory, even when
> bign = T.
>
> I am running R-1.3.1 on Sun OS 5.6.
>
> Thanks for any help.
>
> Matthew Wiener
> Applied Computer Science and Mathematics Department
> Merck Research Labs
> Rahway, NJ  07065-0900
> 732-594-5303
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._