[R] memory issue trying to solve too large a problem using hclust

Wed Dec 5 16:39:24 CET 2001

Hello,

I am wondering if the "dna library 0.2" that Proffesor Jim Landsey made
available on March 2000 (www.luc.ac.be/~jlindsey/rcode.html ) as well as his
papers ("An Introduction to Markov Models in Molecular Biology. (2000)
available at http://alpha.luc.ac.be/~lucp0753/manuscripts.html ) are useful
for your purposes.

Thanks,
Carlos Ortega.

-----Mensaje original-----
De: owner-r-help at stat.math.ethz.ch
[mailto:owner-r-help at stat.math.ethz.ch]En nombre de Aboubakar Maitournam
Enviado el: miercoles, 05 de diciembre de 2001 11:15
Para: cstrato at EUnet.at; r-help at stat.math.ethz.ch
Asunto: Re: [R] memory issue trying to solve too large a problem using
hclust

"cstrato at EUnet.at" wrote:

> Dear Aboubakar
>
> Thank you for your reply. I know that clustering is not a trivial
> issue, this was the reason I thought that I could start a discussion.
> It may not seem to belong to r-help but since many people (including
> me) use R/S for expression profiling, I thought I will try it anyhow.
>
> Since you mention  distance based on correlation, this was my
> question #2: Is it possible, that R/S can also support it for hclust?
> Since I use S/R as my main packages, it it a severe limitation to
> have a limited choice of metrices.
>
> You mention that k-means can have many solutions, but as far, as
> I know, the results of agglomerative hierarchical clustering
> depend on  the order of the data? For this reason, one company
> (Applied Maths) does even calculate the significance of the branches
> of a tree using bootstrap techniques. Could this possibly be done
> also with R/S?
>
> Furthermore, if I remember correctly, someone has mentioned that
> divisive hierarchical clustering would be preferrable to agglomerative
> clustering, but there exist no algorithms to calculate it in a reasonable
> time. (Could it be that this was mentioned by Prof. Ripley?)
>
> Quite some time ago I have tried the different cluster algorithms
> and metrices available in S/R and at that time, DIANA  seemed to give
> the best results. I think it is sorry, that more recent cluster algorithms
> such as CURE etc (see question #4) are not implemented so that
> it is not possible to try them and compare them with the currently
> used ones.
>
> (BTW, mclust seems to give especially bad results, but I do not
> know why?)
>
> Personally, I would prefer to have a function, which would cluster
> data using a couple of different cluster algorithms, then identify those
> branches in a tree which always turn up to be in the same sub-cluster,
> which could then be considered as "stable".
>
> Best regards
> Christian Stratowa
>
> Aboubakar Maitournam wrote:
>
> >
> > I m not famous statistician (so I will walk on eggs) but I know that the
clustering
> > problem is not a trivial task and is not
> > completely solved. The most used technique in the clustering of genes
expression
> > data is based
> > on hierarchical clustering which is depending of the choice of distance.
There is
> > some consensus
> > about the distance based on correlation (take care because sometimes
it's not the
> > distance is the
> > strict topological sense, in the sense of metric space). In addition the
> > hierarchical clustering is noise
> > depending. But related to the phylogenetic practices and the pioneer
work of Eisen,
> > the hierarchical clustering is
> > the wide technique used in the area of the genes expression data
analysis (for the
> > clustering).
> > The k-means as hierarchical clustering has arbitrary choices and can
give many
> > solutions.
> > The methods which are in theoretical developments, which give the number
of
> > clustering in data and determine the corresponding
> > classes are based on mixture models as the package mclust or some
published work
> > base of simulated annealing.
> > But naturally it's difficult to change "les habitudes" (the usual
practices) and
> > perhaps the stochastic background which is not  poetic on which these
methods
> > are based , is explaining why they are not used.
> > Finally if you want to use the classical methods (pca, k-means,
hierarchical
> > clustering) the best methods is to try at least two methods.
> > Notes there is also non classical methods based on graphs theory or
neural networks
> > but the objective methods remains
> > pca  and stochastic methods.
> >
> > Aboubakar Maitournam.
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-.-
> r-help mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

Effectively there are many problems related to clustering which are not
solved by R,
for the analysis of expression profiling.  However some packages dedicated
to that task
as GeneSom are beginning to be released in R. The commercial softwares solve
some
clustering problems
not all.  The mclust package related to paper published in Bioinformatics
have some
restrictions (number of classes,.....).

Aboubakar Maitournam.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._