[R] memory issue trying to solve too large a problem using hclust

Aboubakar Maitournam amaitour at pasteur.fr
Wed Dec 5 11:14:50 CET 2001


"cstrato at EUnet.at" wrote:

> Dear Aboubakar
>
> Thank you for your reply. I know that clustering is not a trivial
> issue, this was the reason I thought that I could start a discussion.
> It may not seem to belong to r-help but since many people (including
> me) use R/S for expression profiling, I thought I will try it anyhow.
>
> Since you mention  distance based on correlation, this was my
> question #2: Is it possible, that R/S can also support it for hclust?
> Since I use S/R as my main packages, it it a severe limitation to
> have a limited choice of metrices.
>
> You mention that k-means can have many solutions, but as far, as
> I know, the results of agglomerative hierarchical clustering
> depend on  the order of the data? For this reason, one company
> (Applied Maths) does even calculate the significance of the branches
> of a tree using bootstrap techniques. Could this possibly be done
> also with R/S?
>
> Furthermore, if I remember correctly, someone has mentioned that
> divisive hierarchical clustering would be preferrable to agglomerative
> clustering, but there exist no algorithms to calculate it in a reasonable
> time. (Could it be that this was mentioned by Prof. Ripley?)
>
> Quite some time ago I have tried the different cluster algorithms
> and metrices available in S/R and at that time, DIANA  seemed to give
> the best results. I think it is sorry, that more recent cluster algorithms
> such as CURE etc (see question #4) are not implemented so that
> it is not possible to try them and compare them with the currently
> used ones.
>
> (BTW, mclust seems to give especially bad results, but I do not
> know why?)
>
> Personally, I would prefer to have a function, which would cluster
> data using a couple of different cluster algorithms, then identify those
> branches in a tree which always turn up to be in the same sub-cluster,
> which could then be considered as "stable".
>
> Best regards
> Christian Stratowa
>
> Aboubakar Maitournam wrote:
>
> >
> > I m not famous statistician (so I will walk on eggs) but I know that the clustering
> > problem is not a trivial task and is not
> > completely solved. The most used technique in the clustering of genes expression
> > data is based
> > on hierarchical clustering which is depending of the choice of distance. There is
> > some consensus
> > about the distance based on correlation (take care because sometimes it's not the
> > distance is the
> > strict topological sense, in the sense of metric space). In addition the
> > hierarchical clustering is noise
> > depending. But related to the phylogenetic practices and the pioneer work of Eisen,
> > the hierarchical clustering is
> > the wide technique used in the area of the genes expression data analysis (for the
> > clustering).
> > The k-means as hierarchical clustering has arbitrary choices and can give many
> > solutions.
> > The methods which are in theoretical developments, which give the number of
> > clustering in data and determine the corresponding
> > classes are based on mixture models as the package mclust or some published work
> > base of simulated annealing.
> > But naturally it's difficult to change "les habitudes" (the usual practices) and
> > perhaps the stochastic background which is not  poetic on which these methods
> > are based , is explaining why they are not used.
> > Finally if you want to use the classical methods (pca, k-means, hierarchical
> > clustering) the best methods is to try at least two methods.
> > Notes there is also non classical methods based on graphs theory or neural networks
> > but the objective methods remains
> > pca  and stochastic methods.
> >
> > Aboubakar Maitournam.
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Effectively there are many problems related to clustering which are not solved by R,
for the analysis of expression profiling.  However some packages dedicated to that task
as GeneSom are beginning to be released in R. The commercial softwares solve some
clustering problems
not all.  The mclust package related to paper published in Bioinformatics have some
restrictions (number of classes,.....).


Aboubakar Maitournam.




-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list