[R] fitting mixture of gaussians using emclust() of mclust package

Wed Aug 1 17:34:30 CEST 2001

Thanks for help.

Rather than using emclust(), using me() directly with  kmeans-induced
initial starting parameters
seems to work better (not sure how much since to get results I have to
sample the data pretty aggressively). 

But I still found that when I have data with more than 10,000 obs,
it takes the routine painfully long time to converge. I understand that
the speed of 
convergence for EM algorithm is data-dependent and in general very slow.
But do people have some benchmark
estimate for the relationship between the sample size and the
computation time using R from their experience? Can also
some one point out some references/packages for speeding up EM,
especially when sample size and 
dimension are not trivial? (not exactly a R-related question, but I
thought people on this list would be interested in such problems).

Regards,
Jonathan

Christian Hennig wrote:
> 
> On Tue, 31 Jul 2001, Jonathan Qiang Li wrote:
> 

> > Hi,
> >
> > Has someone tried to use mclust package function emclust() to fit a
> > mixture of gaussian model for a relatively large dataset?
> > By "large", I specifically have in mind a data set with 50,000
> > observations and 23 dimensions. My machine has 750M memory and 500M swap
> > space. When I tried to use emclust on the dataset, I consistently get
> > messages such as "Error: cannot allocate vector of size 1991669 Kb". In
> > other words does this mean that R is trying to allocate almost 2000Mb
> > space? Should this be considered abnormal?
> 
> No. I recently talked to A.E.Raftery, one of the designers of the Splus
> original, and he said that there are indeed problems with datasets of more
> than, say, 10000 observations. He said that it is the number of observations
> that matters, not the dimension. The main problem is, according to him,
> the hierarchical routine which leads to the initial partition. He suggests
> to take a random subsample of size 100-1000 and to generate initial starting
> parameters from the subsample. I cannot tell you the details, because I
> have not tried this until now. But the principle is that you can tell
> emclust/mclust somehow, how the starting values are generated and the default,
> the memory intensive hierarchical clustering, must be replaced by some fixed
> starting configuration obtained from a subsample.
> 
> Another hint is that for high dimensions it is not advisable to calculate
> the "VVV"-model because of the high probability for spurious local maxima of
> the likelihood.
> 
> Hope that helps,
> Christian
> 
> ***********************************************************************
> Christian Hennig
> University of Hamburg, Faculty of Mathematics - SPST/ZMS
>  (Schwerpunkt Mathematische Statistik und Stochastische Prozesse,
>  Zentrum fuer Modellierung und Simulation)
> Bundesstrasse 55, D-20146 Hamburg, Germany
> Tel: x40/42838 4907, privat x40/631 62 79
> hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
> #######################################################################
> ich empfehle www.boag.de

-- 
Jonathan Q. Li, PhD
Agilent Technologies Laboratory
Palo Alto, California, USA
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._