[R] optim or nlminb for minimization, which to believe?

Mon Nov 30 00:16:56 CET 2009

Your function named 'gradient' is not the correct gradient. Take as an
example the following point x0, very near to the true minimum,

    x0 <- c(-0.2517964, 0.4898680, -0.2517962, 0.4898681, 0.7500995)

then you get

    > gradient(x0)
    [1] -0.0372110470  0.0001816991 -0.0372102284  0.0001820976 
0.0144292657

but the numerical gradient is different:

    > library(numDeriv)
    > grad(fn, x0)
    [1] -6.151645e-07 -5.507219e-07  1.969143e-07 -1.563892e-07
-4.955502e-08

that is the derivative is close to 0 in any direction -- as it should be for
an optimum.

No wonder, optim et al. get confused when applying this 'gradient'.

Regards
Hans Werner

Doran, Harold wrote:
> 
> I have constructed the function mml2 (below) based on the likelihood
> function described in the minimal latex I have pasted below for anyone who
> wants to look at it. This function finds parameter estimates for a basic
> Rasch (IRT) model. Using the function without the gradient, using either
> nlminb or optim returns the correct parameter estimates and, in the case
> of optim, the correct standard errors.
> 
> By correct, I mean they match another software program as well as the
> rasch() function in the ltm package.
> 
> Now, when I pass the gradient to optim, I get a message of successful
> convergence, but the parameter estimates are not correct, but they are
> *very* close to being correct. But, when I use nlminb with the gradient, I
> get a message of false convergence and again, the estimates are off, but
> again very close to being correct.
> 
> This is best illustrated via the examples:
> 
> ### Sample data set
> set.seed(1234)
> tmp <- data.frame(item1 = sample(c(0,1), 20, replace=TRUE), item2 =
> sample(c(0,1), 20, replace=TRUE), item3 = sample(c(0,1), 20,
> replace=TRUE),item4 = sample(c(0,1), 20, replace=TRUE),item5 =
> sample(c(0,1), 20, replace=TRUE))
> 
> ## Use function mml2 (below) with optim  with use of gradient
>> mml2(tmp,Q=10)
> $par
> [1] -0.2438733  0.4889333 -0.2438733  0.4889333  0.7464162
> 
> $value
> [1] 63.86376
> 
> $counts
> function gradient
>       45        6
> 
> $convergence
> [1] 0
> 
> $message
> NULL
> 
> $hessian
>          [,1]     [,2]     [,3]     [,4]     [,5]
> [1,] 4.095479 0.000000 0.000000 0.000000 0.000000
> [2,] 0.000000 3.986293 0.000000 0.000000 0.000000
> [3,] 0.000000 0.000000 4.095479 0.000000 0.000000
> [4,] 0.000000 0.000000 0.000000 3.986293 0.000000
> [5,] 0.000000 0.000000 0.000000 0.000000 3.800898
> 
> ## Use same function but use nlminb with use of gradient
>> mml2(tmp,Q=10)
> $par
> [1] -0.2456398  0.4948889 -0.2456398  0.4948889  0.7516308
> 
> $objective
> [1] 63.86364
> 
> $convergence
> [1] 1
> 
> $message
> [1] "false convergence (8)"
> 
> $iterations
> [1] 4
> 
> $evaluations
> function gradient
>       41        4
> 
> 
> ### use nlminb but turn off use of gradient
>> mml2(tmp,Q=10)
> $par
> [1] -0.2517961  0.4898682 -0.2517961  0.4898682  0.7500994
> 
> $objective
> [1] 63.8635
> 
> $convergence
> [1] 0
> 
> $message
> [1] "relative convergence (4)"
> 
> $iterations
> [1] 8
> 
> $evaluations
> function gradient
>       11       64
> 
> ### Use optim and turn off gradient
> 
>> mml2(tmp,Q=10)
> $par
> [1] -0.2517990  0.4898676 -0.2517990  0.4898676  0.7500906
> 
> $value
> [1] 63.8635
> 
> $counts
> function gradient
>       22        7
> 
> $convergence
> [1] 0
> 
> $message
> NULL
> 
> $hessian
>            [,1]       [,2]       [,3]       [,4]       [,5]
> [1,]  3.6311153 -0.3992959 -0.4224747 -0.3992959 -0.3764526
> [2,] -0.3992959  3.5338195 -0.3992959 -0.3960956 -0.3798141
> [3,] -0.4224747 -0.3992959  3.6311153 -0.3992959 -0.3764526
> [4,] -0.3992959 -0.3960956 -0.3992959  3.5338195 -0.3798141
> [5,] -0.3764526 -0.3798141 -0.3764526 -0.3798141  3.3784816
> 
> The parameter estimates with and without the use of the gradient are so
> close that I am inclined to believe that the gradient is correct and maybe
> the problem is elsewhere.
> 
> It seems odd that optim seems to converge but nlminb does not with the use
> of the gradient. But, with the use of the gradient in either case, the
> parameter estimates differ from what I think are the correct values. So,
> at this point I am unclear if the problem is somewhere in the way the
> functions are used or how I am passing the gradient or if the problem lies
> in the way I have constructed the gradient itself.
> 
> Below is the function and also some latex for those interested in looking
> at the likelihood function.
> 
> Thanks for any reactions
> Harold
> 
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> i386-pc-mingw32
> 
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] ltm_0.9-2      polycor_0.7-7  sfsmisc_1.0-9  mvtnorm_0.9-8  msm_0.9.4     
> MASS_7.3-3     MiscPsycho_1.5
> [8] statmod_1.4.1
> 
> loaded via a namespace (and not attached):
> [1] splines_2.10.0  survival_2.35-7 tools_2.10.0
> 
> 
> mml2 <- function(data, Q, startVal = NULL, gr = TRUE, ...){
>                 if(!is.null(startVal) && length(startVal) != ncol(data) ){
>                                 stop("Length of argument startVal not
> equal to the number of parameters estimated")
>                 }
>                 if(!is.null(startVal)){
>                                 startVal <- startVal
>                                 } else {
>                                 p <- colMeans(data)
>                                 startVal <- as.vector(log((1 - p)/p))
>                 }
>                 qq <- gauss.quad.prob(Q, dist = 'normal')
>                 rr1 <- matrix(0, nrow = Q, ncol = nrow(data))
>                 data <- as.matrix(data)
>                 L <- nrow(data)
>                 C <- ncol(data)
>                 fn <- function(b){
>                                 for(j in 1:Q){
>                                                 for(i in 1:nrow(data)){
>                                                                 rr1[j,i]
> <- exp(sum(dbinom(data[i,], 1, plogis(qq$nodes[j]-b), log = TRUE))) *
>                                                                
> qq$weights[j]
>                                                 }
>                                 }
>                 -sum(log(colSums(rr1)))
>                 }
>                 gradient <- function(b){
>                                 p <- outer(qq$nodes, b, plogis) * L
>                                 x <- t(matrix(colSums(data), nrow= C, Q))
>                                 w <- matrix(qq$weights, Q, C)
>                                 rr2 <- (-x + p) * w
>                                 -colSums(rr2)
>                 }
>                 opt <- optim(startVal, fn, gr = gradient, hessian = TRUE,
> method = "BFGS")
>                 #opt <- nlminb(startVal, fn, gradient=gradient)
>                 #list("coefficients" = opt$par, "LogLik" = -opt$value,
> "Std.Error" = 1/sqrt(diag(opt$hessian)))
>                 opt
> }
> 
> 
> 
> ### latex describing the likelihood function
> \documentclass{article}
> \usepackage{latexsym}
> \usepackage{amssymb,amsmath,bm}
> \newcommand{\Prb}{\operatorname{Prob}}
> \title{Marginal Maximum Likelihood for IRT Models}
> \author{Harold C. Doran}
> \begin{document}
> \maketitle
> The likelihood function is written as:
> \begin{equation}
> \label{eqn:mml}
> f(x) =
> \prod^N_{i=1}\int_{\Theta}\prod^K_{j=1}p(x|\theta,\beta)f(\theta)d\theta
> \end{equation}
> \noindent where $N$ indexes the total number of individuals, $K$ indexes
> the total number of items, $p(x|\theta,\beta)$ is the data likelihood and
> $f(\theta)$ is a population distribution. For the rasch model, the data
> likelihood is:
> \begin{equation}
> p(x|\theta,\beta) = \prod^N_{i=1}\prod^K_{j=1} \Pr(x_{ij} = 1 | \theta_i,
> \beta_j)^{x_{ij}} \left[1 - \Pr(X_{ij} = 1 | \theta_i,
> \beta_j)\right]^{(1-x_{ij})}
> \end{equation}
> \begin{equation}
> \label{rasch}
> \Pr(x_{ij} = 1 | \theta_i, \beta_j) = \frac{1}{1 +
> e^{-(\theta_i-\beta_j)}} \quad i = (1, \ldots, K); j = (1, \ldots, N)
> \end{equation}
> \noindent where $\theta_i$ is the ability estimate of person $i$ and
> $\beta_j$ is the location parameter for item $j$. The integral in
> Equation~(\ref{eqn:mml}) has no closed form expression, so it is
> approximated using Gauss-Hermite quadrature as:
> \begin{equation}
> \label{eqn:mml:approx}
> f(x) \approx
> \prod^N_{i=1}\sum_{q=1}^{Q}\prod^K_{j=1}p(x|\theta_q,\beta)w_q
> \end{equation}
> \noindent where $q$ indexes the node at quadrature point $q$ and $w$ is
> the weight at quadrature point $q$. With Equation~(\ref{eqn:mml:approx}),
> the remaining challenge is to find $\underset{x}{\operatorname{arg\,max}}
> \, f(x)$.
> \end{document}
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://n4.nabble.com/optim-or-nlminb-for-minimization-which-to-believe-tp930841p930942.html
Sent from the R help mailing list archive at Nabble.com.