[R] problem with convergence in mle2/optim function

Sun Dec 2 23:14:31 CET 2012

Dear Berend,

Thank you so much for your help!  I was able to write the gradient 
function for my NLL function.  For you're and others' possible interest, 
here is my final gradient function:

Following from my description of the problem below, gr.p1, gr.p2, 
gr.mu1, and gr.mu2 are the (very large) gradient equations of NLL2, 
below, with respect to the parameters p1, p2, mu1, and mu2, 
respectively.  The gradient function is:

# Gradient function of NLL1
grr <- function(par, y){
  p1 <- par[1]
  p2 <- par[2]
  mu1 <- par[3]
  mu2 <- par[4]
  t <- y[,1]
  n1 <- y[,2]
  n2 <- y[,3]
  n3 <- y[,4]
  gr.p1 <- ....
  gr.p2 <- ....
  gr.mu1 <- ....
  gr.mu2 <- ....
  gr.mat <- matrix(c(gr.p1, gr.p2, gr.mu1, gr.mu2), ncol = 4)
  -colSums(gr.mat)
}

I verified this gradient function with numerical approximation with the 
grad() function [numDeriv package].

Thanks again.
Adam Zeilinger

On 10/10/2012 3:57 AM, Berend Hasselman wrote:
> On 10-10-2012, at 00:21, Adam Zeilinger wrote:
>
>> Dear R help,
>>
>> Thanks again for the responses.  I increased the lower constraint to:
>>
>> lower = list(p1 = 0.0001, p2 = 0.0001, mu1 = 0.0001, mu2 = 0.0001).
>>
>> I also included an upper box constraint of:
>>
>> upper = list(p1 = Inf, p2 = Inf, mu1 = p1t, mu2 = p2t).
>>
>> Making these changes improved the rate of convergence among stochastic simulation runs, but I still had convergence problems.
>>
>> I found success in switching from mle2/optim to spg (BB package).  So far, spg has produced similarly precise estimates as L-BFGS-B and consistently provides parameter estimates.
>>
>> If anyone is interested, here is the new objective function and spg call, instead of my previous objective function and mle2 call.  All other parts of my reproducible code are the same as I've previously supplied:
>>
>> ######################################################################
>> library(BB)
>>
>> # Objective function for spg()
>> NLL2 <- function(par, y){
>>   p1 <- par[1]
>>   p2 <- par[2]
>>   mu1 <- par[3]
>>   mu2 <- par[4]
>>   t <- y$tv
>>   n1 <- y$n1
>>   n2 <- y$n2
>>   n3 <- y$n3
>>   P1 <- (p1*((-1 + exp(sqrt((mu1 + mu2 + p1 + p2)^2 -
>>     4*(mu2*p1 + mu1*(mu2 + p2)))*t))*((-mu2)*(mu2 - p1 + p2) +
>>     mu1*(mu2 + 2*p2)) - mu2*sqrt((mu1 + mu2 + p1 + p2)^2 -
>>     4*(mu2*p1 + mu1*(mu2 + p2))) -
>>     exp(sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))*t)*
>>     mu2*sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2))) +
>>     2*exp((1/2)*(mu1 + mu2 + p1 + p2 + sqrt((mu1 + mu2 + p1 + p2)^2 -
>>     4*(mu2*p1 + mu1*(mu2 + p2))))*t)*mu2*
>>     sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))))/
>>     exp((1/2)*(mu1 + mu2 + p1 + p2 + sqrt((mu1 + mu2 + p1 + p2)^2 -
>>     4*(mu2*p1 + mu1*(mu2 + p2))))*t)/(2*(mu2*p1 + mu1*(mu2 + p2))*
>>     sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2))))
>>   P2 <- (p2*((-1 + exp(sqrt((mu1 + mu2 + p1 + p2)^2 -
>>     4*(mu2*p1 + mu1*(mu2 + p2)))*t))*(-mu1^2 + 2*mu2*p1 +
>>     mu1*(mu2 - p1 + p2)) - mu1*sqrt((mu1 + mu2 + p1 + p2)^2 -
>>     4*(mu2*p1 + mu1*(mu2 + p2))) -
>>     exp(sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))*t)*
>>     mu1*sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2))) +
>>     2*exp((1/2)*(mu1 + mu2 + p1 + p2 + sqrt((mu1 + mu2 + p1 + p2)^2 -
>>     4*(mu2*p1 + mu1*(mu2 + p2))))*t)*mu1*
>>     sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))))/
>>     exp((1/2)*(mu1 + mu2 + p1 + p2 + sqrt((mu1 + mu2 + p1 + p2)^2 -
>>     4*(mu2*p1 + mu1*(mu2 + p2))))*t)/(2*(mu2*p1 + mu1*(mu2 + p2))*
>>     sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2))))
>>   P3 <- 1 - P1 - P2
>>   p.all <- c(P1, P2, P3)
>>   #cat("NLL.free p.all {P1,P2,P3}\n")
>>   #print(matrix(p.all, ncol=3))
>>   -sum(dmnom2(c(n1, n2, n3), prob = p.all, log = TRUE))
>> }
>>
>> par <- c(p1t, p2t, mu1t, mu2t)
>>
>> spg.fit <- spg(par = par, fn = NLL2, y = yt,
>>             lower = c(0.001, 0.001, 0.001, 0.001),
>>             control = list(maxit = 5000))
>>
>> ########################################################################
>>
>> My next problem is that spg takes about twice as long as L-BFGS-B to converge.  The spg help file strongly suggests the use of an exact gradient function to improve speed.  But I am having trouble writing a gradient function.  Here is what I have so far:
>>
>> I derived the gradient function by taking the derivative of my NLL equation with respect to each parameter.  My NLL equation is the probability mass function of the trinomial distribution.  Here is some reproducible code:
>>
>> #########################################################################
>> library(Ryacas)
>>
>> p1 <- Sym("p1"); p2 <- Sym("p2"); mu1 <- Sym("mu1"); mu2 <- Sym("mu2")
>> t <- Sym("t"); n1 <- Sym("n1"); n2 <- Sym("n2"); n3 <- Sym("n3")
>>
>> P1.symb <- ((p1*((-1 + exp(sqrt((mu1 + mu2 + p1 + p2)^2 -
>>   4*(mu2*p1 + mu1*(mu2 + p2)))*t))*((-mu2)*(mu2 - p1 + p2) +
>>   mu1*(mu2 + 2*p2)) - mu2*sqrt((mu1 + mu2 + p1 + p2)^2 -
>>   4*(mu2*p1 + mu1*(mu2 + p2))) -
>>   exp(sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))*t)*
>>   mu2*sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2))) +
>>   2*exp((1/2)*(mu1 + mu2 + p1 + p2 + sqrt((mu1 + mu2 + p1 + p2)^2 -
>>   4*(mu2*p1 + mu1*(mu2 + p2))))*t)*mu2*
>>   sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))))/
>>   exp((1/2)*(mu1 + mu2 + p1 + p2 + sqrt((mu1 + mu2 + p1 + p2)^2 -
>>   4*(mu2*p1 + mu1*(mu2 + p2))))*t)/(2*(mu2*p1 + mu1*(mu2 + p2))*
>>   sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))))
>>
>> P2.symb <- ((p2*((-1 + exp(sqrt((mu1 + mu2 + p1 + p2)^2 -
>>   4*(mu2*p1 + mu1*(mu2 + p2)))*t))*(-mu1^2 + 2*mu2*p1 +
>>   mu1*(mu2 - p1 + p2)) - mu1*sqrt((mu1 + mu2 + p1 + p2)^2 -
>>   4*(mu2*p1 + mu1*(mu2 + p2))) -
>>   exp(sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))*t)*
>>   mu1*sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2))) +
>>   2*exp((1/2)*(mu1 + mu2 + p1 + p2 + sqrt((mu1 + mu2 + p1 + p2)^2 -
>>   4*(mu2*p1 + mu1*(mu2 + p2))))*t)*mu1*
>>   sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))))/
>>   exp((1/2)*(mu1 + mu2 + p1 + p2 + sqrt((mu1 + mu2 + p1 + p2)^2 -
>>   4*(mu2*p1 + mu1*(mu2 + p2))))*t)/(2*(mu2*p1 + mu1*(mu2 + p2))*
>>   sqrt((mu1 + mu2 + p1 + p2)^2 - 4*(mu2*p1 + mu1*(mu2 + p2)))))
>>
>> P3.symb <- 1 - P1.symb - P2.symb
>>
>> # gradient equation with respect to parameter p1 for probability mass
>> # function of the trinomial distribution with probabilities P1, P2, P3
>>
>> gr.p1 <- deriv(log(P1.symb^n1), p1) + deriv(log(P2.symb^n2), p1) + deriv(log(P3.symb^n3), p1)
>>
>> ######################################################################
>>
>> gr.p1 is very large equation, which I won't reproduce here. Let's say that the four gradient equations for the four parameters are defined as gr.p1, gr.p2, gr.mu1, gr.mu2, and all are derived as described above for gr.p1.  These gradient equations are functions of p1, p2, mu1, mu2, t, n1, n2, and n3.  My current gradient function is:
>>
>> grr <- function(par, y){
>>   p1 <- par[1]
>>   p2 <- par[2]
>>   mu1 <- par[3]
>>   mu2 <- par[4]
>>   t <- y[,1]
>>   n1 <- y[,2]
>>   n2 <- y[,3]
>>   n3 <- y[,4]
>>   gr.p1 <- ....
>>   gr.p2 <- ....
>>   gr.mu1 <- ....
>>   gr.mu2 <- ....
>>   c(gr.p1, gr.p2, gr.mu1, gr.mu2)
>> }
>>
>> The problem is that I need to supply values for t, n1, n2, and n3 to the gradient function, which are from the dataset yt, above.  When I supply the dataset yt, the grr function produces a vector of length 4*nrow(yt) = 80.  However, spg requires a gradient function that returns a vector of length(par) = 4.  When I include this gradient function in my spg function, I get an error that the gradient function is incorrect.
>>
>> Does anyone have any suggestions on how to write my gradient function?
> Your grr function is not returning the gradient of NLL2 w.r.t. the parameters p1,p2,mu1,mu2.
> You have 20 time periods and 4 parameters. Your grr is returning a single vector containing the gradients stacked (length=4*20).
> You need to return the gradient of -sum(dmnom2(c(n1, n2, n3), prob = p.all, log = TRUE)) with respect to those parameters.
> That would be a vector of length 4.
>
>> Am I calculating the gradient equation, gr.p1 incorrectly?
> I don't know.
>
> Berend
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Adam Zeilinger
Post Doctoral Scholar
Department of Entomology
University of California Riverside
www.linkedin.com/in/adamzeilinger