[R] fitting mixture of gaussians using emclust() of mclust package

Murray Jorgensen maj at waikato.ac.nz
Thu Aug 2 06:22:56 CEST 2001


Jonathan,

I have no direct experience with mclust, but I have used Multimix, a similar
program whose design I was involved in.

I believe that one way to speed up EM for mixtures is to allocate all
observations to clusters according to an unconverged set of parameters,
then to
re-start the algorithm using this as the initial clustering.

Another way might be to slip in a stochastic EM step, where observations are
allocated probabilistically to clusters using the current values of the
observation-specific cluster membership probabilities.

If clustering is your main purpose in fitting the mixture model then complete
convergence may not be all that important as clusters tend to stabilize
quite a
while before the parameter estimates.

Multimix is written in Fortran77 and also copes with categorical variables in
addition to gaussian. You may download it from my home page along with some
documentation. I would appreciate feedback on how well it copes with your data
if you use it. You will need to adjust some array bounds and recompile to tune
it to your data set.

Murray Jorgensen

At 08:34 AM 1-08-01 -0700, you wrote:
>Thanks for help.
>
>Rather than using emclust(), using me() directly with  kmeans-induced
>initial starting parameters
>seems to work better (not sure how much since to get results I have to
>sample the data pretty aggressively). 
>
>But I still found that when I have data with more than 10,000 obs,
>it takes the routine painfully long time to converge. I understand that
>the speed of 
>convergence for EM algorithm is data-dependent and in general very slow.
>But do people have some benchmark
>estimate for the relationship between the sample size and the
>computation time using R from their experience? Can also
>some one point out some references/packages for speeding up EM,
>especially when sample size and 
>dimension are not trivial? (not exactly a R-related question, but I
>thought people on this list would be interested in such problems).
>
>Regards,
>Jonathan
>
>Christian Hennig wrote:
>> 
>> On Tue, 31 Jul 2001, Jonathan Qiang Li wrote:
>> 
>
>
>> > Hi,
>> >
>> > Has someone tried to use mclust package function emclust() to fit a
>> > mixture of gaussian model for a relatively large dataset?
>> > By "large", I specifically have in mind a data set with 50,000
>> > observations and 23 dimensions. My machine has 750M memory and 500M swap
>> > space. When I tried to use emclust on the dataset, I consistently get
>> > messages such as "Error: cannot allocate vector of size 1991669 Kb". In
>> > other words does this mean that R is trying to allocate almost 2000Mb
>> > space? Should this be considered abnormal?
>> 
>> No. I recently talked to A.E.Raftery, one of the designers of the Splus
>> original, and he said that there are indeed problems with datasets of more
>> than, say, 10000 observations. He said that it is the number of
observations
>> that matters, not the dimension. The main problem is, according to him,
>> the hierarchical routine which leads to the initial partition. He suggests
>> to take a random subsample of size 100-1000 and to generate initial
starting
>> parameters from the subsample. I cannot tell you the details, because I
>> have not tried this until now. But the principle is that you can tell
>> emclust/mclust somehow, how the starting values are generated and the
default,
>> the memory intensive hierarchical clustering, must be replaced by some
fixed
>> starting configuration obtained from a subsample.
>> 
>> Another hint is that for high dimensions it is not advisable to calculate
>> the "VVV"-model because of the high probability for spurious local
maxima of
>> the likelihood.
>> 
>> Hope that helps,
>> Christian
>> 
>> ***********************************************************************
>> Christian Hennig
>> University of Hamburg, Faculty of Mathematics - SPST/ZMS
>>  (Schwerpunkt Mathematische Statistik und Stochastische Prozesse,
>>  Zentrum fuer Modellierung und Simulation)
>> Bundesstrasse 55, D-20146 Hamburg, Germany
>> Tel: x40/42838 4907, privat x40/631 62 79
>> hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
>> #######################################################################
>> ich empfehle www.boag.de
>
>-- 
>Jonathan Q. Li, PhD
>Agilent Technologies Laboratory
>Palo Alto, California, USA
>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
.-.-
>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
>Send "info", "help", or "[un]subscribe"
>(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
._._
>  
Dr Murray Jorgensen       http://www.stats.waikato.ac.nz/Staff/maj.html 
Department of Statistics, University of Waikato, Hamilton, New Zealand 
*Applications Editor, Australian and New Zealand Journal of Statistics* 
maj at waikato.ac.nz Phone +64-7 838 4773 home phone 856 6705 Fax 838 4155

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list