[R] separate mixture of gamma and normal (with mixtools or ??)

David Winsemius dwinsemius at comcast.net
Thu Jan 26 21:26:57 CET 2017


> On Jan 23, 2017, at 11:47 PM, Thomas Petzoldt <thpe at simecol.de> wrote:
> 
> Dear Bert,
> 
> thank you very much for your suggestion. You are right, ill-conditioning was sometimes a problem for 3 components, but not in the two-component case. The modes are well separated, and the sample size is high.
> 
> My main problem is (1) the shape of the distributions and (2) the diversity of available packages and approaches to this topic.
> 
> In the mean time I made some progress in this matter by treating the data as a mixture of gamma distributions (package mixdist, see below), so what I want to know is the purely R technical question ;-)
> 
> Has someone else has ever stumbled across something like this and can make a suggestion which package to use?

In survival analysis of cancer cases, the question of cure comes up often. Physicians sometimes have a naive notion of survival to 5 years after definitive treatment with no evident recurrence being equivalent to 'cure', despite the fact that there is great heterogeneity in the recurrence and survival distribution of different cancer types. I have see papers in the medical statistical literature that used mixtures of Weibull variates to model this problem. The cancer-specific survival is often exponential (mu=1) or "sub-exponential" (shape < 1) whereas non-cancer survival times are "super-exponential" (shape >> 1). When I ran your second simulation with dist='weibull' I get:

> library("mixdist")
> set.seed(123)
> lambda <- c(0.25, 0.75)
> N <- 2000
> x <- c(rexp(lambda[1]*N, 1), rnorm(lambda[2]*N, 20, 4))
> xx <- mixgroup(x, breaks=0:40)
> pp <- mixparam(mu=c(.5, 8), sigma=c(1, 10), pi=c(0.2, 0.5))
> mix <-  mix(xx, pp, dist="weibull", emsteps=10)
> 
> summary(mix)

Parameters:
      pi     mu  sigma
1 0.2492  1.016 0.8702
2 0.7508 20.079 4.2328

Standard Errors:
     pi.se   mu.se sigma.se
1 0.009683 0.04249  0.05137
2 0.009683 0.10972  0.06802


Analysis of Variance Table

          Df  Chisq Pr(>Chisq)    
Residuals 29 80.407   1.01e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> p <- coef(mix)
> p
         pi        mu     sigma
1 0.2492477  1.016319 0.8702055
2 0.7507523 20.079093 4.2327769

So the exponential parameter at least is well-estimated. I believe that Weibull variates with  shape >> 1 are approximately normal, but I know that your mathematical sophistication exceeds mine by quite a bit, so consider this only a dilettante's comment.

-- 
David.


> 
> Thanks, Thomas
> 
> 
> ## Approximate an Exponential+Gaussian mixture with
> ##   a mixture of Gammas
> 
> library("mixdist")
> set.seed(123)
> lambda <- c(0.25, 0.75)
> N <- 2000
> x <- c(rexp(lambda[1]*N, 1), rnorm(lambda[2]*N, 20, 4))
> 
> xx <- mixgroup(x, breaks=0:40)
> pp <- mixparam(mu=c(1, 8), sigma=c(1, 3), pi=c(0.2, 0.5))
> mix <-  mix(xx, pp, dist="gamma", emsteps=10)
> 
> summary(mix)
> p <- coef(mix)
> beta <- with(p, sigma^2/mu)
> alpha <- with(p, mu /beta)
> lambda <- p$pi
> 
> plot(mix, xlim=c(0, 35))
> x1 <- seq(0, 35, 0.1)
> lines(x1, lambda[1]*dgamma(x1, alpha[1], 1/beta[1]),
>  col="orange", lwd=2)
> lines(x1, lambda[2]*dgamma(x1, alpha[2], 1/beta[2]),
>  col="magenta", lwd=2)
> 
> 
> 
> Am 24.01.2017 um 00:34 schrieb Bert Gunter:
>> Fitting multicomponent mixtures distributions -- and 3 is already a
>> lot of components -- is inherently ill-conditioned. You may need to
>> reassess your strategy. You might wish to post on stackexchange
>> instead to discuss such statistical issues.
>> 
>> Cheers,
>> Bert
>> Bert Gunter
>> 
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>> 
>> 
>> On Mon, Jan 23, 2017 at 2:32 PM, Thomas Petzoldt <thpe at simecol.de> wrote:
>>> Dear friends,
>>> 
>>> I am trying to separate bi- (and sometimes tri-) modal univariate mixtures
>>> of biological data, where the first component is left bounded (e.g.
>>> exponential or gamma) and the other(s) approximately Gaussian.
>>> 
>>> After checking several packages, I'm not really clear what to do. Here is an
>>> example with "mixtools" that already works quite good, however, the left
>>> component is not Gaussian (and not symmetric).
>>> 
>>> Any idea about a more adequate function or package for this problem?
>>> 
>>> Thanks a lot!
>>> 
>>> Thomas
>>> 
>>> 
>>> 
>>> library(mixtools)
>>> set.seed(123)
>>> 
>>> lambda <- c(0.25, 0.75)
>>> N      <- 200
>>> 
>>> ## dist1 ~ gamma (or exponential as a special case)
>>> #dist1 <- rexp(lambda[1]*N, 1)
>>> dist1 <- rgamma(lambda[1]*N, 1, 1)
>>> 
>>> ## dist2 ~ normal
>>> dist2 <- rnorm(lambda[2]*N, 12, 2)
>>> 
>>> ## mixture
>>> x <- c(dist1, dist2)
>>> 
>>> mix <-  spEMsymloc(x, mu0=2, eps=1e-3, verbose=TRUE)
>>> plot(mix, xlim=c(0, 25))
>>> summary(mix)
>>> 
>>> 
>>> --
>>> Thomas Petzoldt
>>> TU Dresden, Institute of Hydrobiology
>>> http://www.tu-dresden.de/Members/thomas.petzoldt
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list