[R] maximum likelihood convergence reproducing Anderson Blundell 1982 Econometrica R vs Stata

Sat May 21 23:26:29 CEST 2011

Hi,

I don't think the final verdict has been spoken.  Peter's posts have hinted at ill-conditioning as the crux of the problem.  So, I decided to try a couple of  more things: (1) standardizing the covariates, (2) exact gradient, and (3) both (1) and (2).

I compute the "exact" gradient using a complex-step derivative approach.  This works just like the standard first-order, forward differencing.  The only (but, essential) difference is that an imaginary increment, i*dx, is used.  This, incredibly, gives exact gradients (up to machine precision). 

Here are the code and the results of my experiments:

data <- read.table("h:/computations/optimx_example.dat", header=TRUE, sep=",")
attach(data)
require(optimx)

## model 18
lnl <- function(theta,y1, y2, x1, x2, x3) {
  n <- length(y1)
  beta <- theta[1:8]
  e1 <- y1 - theta[1] - theta[2]*x1 - theta[3]*x2 - theta[4]*x3
  e2 <- y2 - theta[5] - theta[6]*x1 - theta[7]*x2 - theta[8]*x3
  e <- cbind(e1, e2)
  sigma <- t(e)%*%e
  det.sigma <- sigma[1,1] * sigma[2,2] - sigma[1,2] * sigma[2,1]
  logl <- -1*n/2*(2*(1+log(2*pi)) + log(det.sigma)) 
  return(-logl)
}

csd <- function(fn, x, ...) {
# Complex step derivatives; yields exact derivatives
h <- .Machine$double.eps 
n <- length(x)
h0 <- g <- rep(0, n)
for (i in 1:n) {
h0[i] <- h * 1i
g[i] <- Im(fn(x+h0, ...))/h 
h0[i]  <- 0
}
g
}

gr.csd <- function(theta,y1, y2, x1, x2, x3) {
csd(lnl, theta, y1=y1, y2=y2, x1=x1, x2=x2, x3=x3)
}

# exact solution as the starting value
start <- c(coef(lm(y1~x1+x2+x3)), coef(lm(y2~x1+x2+x3)))
p1 <- optimx(start, lnl, y1=y1, y2=y2, x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))

# numerical gradient
p2 <- optimx(rep(0,8), lnl, y1=y1, y2=y2, x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))

# exact gradient
p2g <- optimx(rep(0,8), lnl, gr.csd, y1=y1, y2=y2, x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))

# comparing p2 and p2g, we see the dramatic improvement in BFGS when exact gradient is used, we also see a major difference for L-BFGS-B
# Exact gradient did not affect the gradient methods, CG and spg, much.  However, convergence of Rcgmin improved when exact gradient was used

x1s <- scale(x1)
x2s <- scale(x2)
x3s <- scale(x3)

p3 <- optimx(rep(0,8),lnl, y1=y1, y2=y2, x1=x1s, x2=x2s, x3=x3s, control=list(all.methods=TRUE, maxit=1500))

# both scaling and exact gradient
p3g <- optimx(rep(0,8),lnl, gr.csd, y1=y1, y2=y2, x1=x1s, x2=x2s, x3=x3s, control=list(all.methods=TRUE, maxit=1500))

# Comparing p3 and p3g, use of exact gradient improved spg dramatically.  However, it made CG worse! 

# Of course, derivative-free methods newuoa, and Nelder-Mead are improved by scaling, but not by the availability of exact gradients.  I don't know what is wrong with bobyqa in this example.

In short, even with scaling and exact gradients, this optimization problem is recalcitrant.

Best,
Ravi.

________________________________________
From: Mike Marchywka [marchywka at hotmail.com]
Sent: Thursday, May 12, 2011 8:30 AM
To: Ravi Varadhan; pdalgd at gmail.com; alex.olssen at gmail.com
Cc: r-help at r-project.org
Subject: RE: [R] maximum likelihood convergence reproducing Anderson Blundell 1982 Econometrica R vs Stata

So what was the final verdict on this discussion? I kind of
lost track if anyone has a minute to summarize and critique my summary below.

Apparently there were two issues, the comparison between R and Stata
was one issue and the "optimum" solution another. As I understand it,
there was some question about R numerical gradient calculation. This would
suggest some features of the function may be of interest to consider.

The function to be optimized appears to be, as OP stated,
some function of residuals of two ( unrelated ) fits.
The residual vectors e1 and e2 are dotted in various combinations
creating a matrix whose determinant is (e1.e1)(e2.e2)-(e1.e2)^2 which
is the result to be minimized by choice of theta. Theta it seems is
an 8 component vector, 4 components determine e1 and the other 4 e2.
Presumably a unique solution would require that e1 and e2, both n-component vectors,
 point in different directions or else both could become aribtarily large
while keeping the error signal at zero. For fixed magnitudes, colinearity
would reduce the "Error."  The intent would appear to be to
keep the residuals distributed similarly in the two ( unrelated) fits.
 I guess my question is,
" did anyone determine that there is a unique solution?" or
am I totally wrong here ( I haven't used these myself to any
extent and just try to run some simple teaching examples, asking
for my own clarification as much as anything).

Thanks.

----------------------------------------
> From: rvaradhan at jhmi.edu
> To: pdalgd at gmail.com; alex.olssen at gmail.com
> Date: Sat, 7 May 2011 11:51:56 -0400
> CC: r-help at r-project.org
> Subject: Re: [R] maximum likelihood convergence reproducing Anderson Blundell 1982 Econometrica R vs Stata
>
> There is something strange in this problem. I think the log-likelihood is incorrect. See the results below from "optimx". You can get much larger log-likelihood values than for the exact solution that Peter provided.
>
> ## model 18
> lnl <- function(theta,y1, y2, x1, x2, x3) {
> n <- length(y1)
> beta <- theta[1:8]
> e1 <- y1 - theta[1] - theta[2]*x1 - theta[3]*x2 - theta[4]*x3
> e2 <- y2 - theta[5] - theta[6]*x1 - theta[7]*x2 - theta[8]*x3
> e <- cbind(e1, e2)
> sigma <- t(e)%*%e
> logl <- -1*n/2*(2*(1+log(2*pi)) + log(det(sigma))) # it looks like there is something wrong here
> return(-logl)
> }
>
> data <- read.table("e:/computing/optimx_example.dat", header=TRUE, sep=",")
>
> attach(data)
>
> require(optimx)
>
> start <- c(coef(lm(y1~x1+x2+x3)), coef(lm(y2~x1+x2+x3)))
>
> # the warnings can be safely ignored in the "optimx" calls
> p1 <- optimx(start, lnl, hessian=TRUE, y1=y1, y2=y2,
> + x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))
>
> p2 <- optimx(rep(0,8), lnl, hessian=TRUE, y1=y1, y2=y2,
> + x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))
>
> p3 <- optimx(rep(0.5,8), lnl, hessian=TRUE, y1=y1, y2=y2,
> + x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))
>
> Ravi.
> ________________________________________
> From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] On Behalf Of peter dalgaard [pdalgd at gmail.com]
> Sent: Saturday, May 07, 2011 4:46 AM
> To: Alex Olssen
> Cc: r-help at r-project.org
> Subject: Re: [R] maximum likelihood convergence reproducing Anderson Blundell 1982 Econometrica R vs Stata
>
> On May 6, 2011, at 14:29 , Alex Olssen wrote:
>
> > Dear R-help,
> >
> > I am trying to reproduce some results presented in a paper by Anderson
> > and Blundell in 1982 in Econometrica using R.
> > The estimation I want to reproduce concerns maximum likelihood
> > estimation of a singular equation system.
> > I can estimate the static model successfully in Stata but for the
> > dynamic models I have difficulty getting convergence.
> > My R program which uses the same likelihood function as in Stata has
> > convergence properties even for the static case.
> >
> > I have copied my R program and the data below. I realise the code
> > could be made more elegant - but it is short enough.
> >
> > Any ideas would be highly appreciated.
>
> Better starting values would help. In this case, almost too good values are available:
>
> start <- c(coef(lm(y1~x1+x2+x3)), coef(lm(y2~x1+x2+x3)))
>
> which appears to be the _exact_ solution.
>
> Apart from that, it seems that the conjugate gradient methods have difficulties with this likelihood, for some less than obvious reason. Increasing the maxit gets you closer but still not satisfactory.
>
> I would suggest trying out the experimental optimx package. Apparently, some of the algorithms in there are much better at handling this likelihood, notably "nlm" and "nlminb".
>
>
>
>
> >
> > ## model 18
> > lnl <- function(theta,y1, y2, x1, x2, x3) {
> > n <- length(y1)
> > beta <- theta[1:8]
> > e1 <- y1 - theta[1] - theta[2]*x1 - theta[3]*x2 - theta[4]*x3
> > e2 <- y2 - theta[5] - theta[6]*x1 - theta[7]*x2 - theta[8]*x3
> > e <- cbind(e1, e2)
> > sigma <- t(e)%*%e
> > logl <- -1*n/2*(2*(1+log(2*pi)) + log(det(sigma)))
> > return(-logl)
> > }
> > p <- optim(0*c(1:8), lnl, method="BFGS", hessian=TRUE, y1=y1, y2=y2,
> > x1=x1, x2=x2, x3=x3)
> >
> > "year","y1","y2","x1","x2","x3"
> > 1929,0.554779,0.266051,9.87415,8.60371,3.75673
> > 1930,0.516336,0.297473,9.68621,8.50492,3.80692
> > 1931,0.508201,0.324199,9.4701,8.27596,3.80437
> > 1932,0.500482,0.33958,9.24692,7.99221,3.76251
> > 1933,0.501695,0.276974,9.35356,7.98968,3.69071
> > 1934,0.591426,0.287008,9.42084,8.0362,3.63564
> > 1935,0.565047,0.244096,9.53972,8.15803,3.59285
> > 1936,0.605954,0.239187,9.6914,8.32009,3.56678
> > 1937,0.620161,0.218232,9.76817,8.42001,3.57381
> > 1938,0.592091,0.243161,9.51295,8.19771,3.6024
> > 1939,0.613115,0.217042,9.68047,8.30987,3.58147
> > 1940,0.632455,0.215269,9.78417,8.49624,3.57744
> > 1941,0.663139,0.184409,10.0606,8.69868,3.6095
> > 1942,0.698179,0.164348,10.2892,8.84523,3.66664
> > 1943,0.70459,0.146865,10.4731,8.93024,3.65388
> > 1944,0.694067,0.161722,10.4465,8.96044,3.62434
> > 1945,0.674668,0.197231,10.279,8.82522,3.61489
> > 1946,0.635916,0.204232,10.1536,8.77547,3.67562
> > 1947,0.642855,0.187224,10.2053,8.77481,3.82632
> > 1948,0.641063,0.186566,10.2227,8.83821,3.96038
> > 1949,0.646317,0.203646,10.1127,8.82364,4.0447
> > 1950,0.645476,0.187497,10.2067,8.84161,4.08128
> > 1951,0.63803,0.197361,10.2773,8.9401,4.10951
> > 1952,0.634626,0.209992,10.283,9.01603,4.1693
> > 1953,0.631144,0.219287,10.3217,9.06317,4.21727
> > 1954,0.593088,0.235335,10.2101,9.05664,4.2567
> > 1955,0.60736,0.227035,10.272,9.07566,4.29193
> > 1956,0.607204,0.246631,10.2743,9.12407,4.32252
> > 1957,0.586994,0.256784,10.2396,9.1588,4.37792
> > 1958,0.548281,0.271022,10.1248,9.14025,4.42641
> > 1959,0.553401,0.261815,10.2012,9.1598,4.4346
> > 1960,0.552105,0.275137,10.1846,9.19297,4.43173
> > 1961,0.544133,0.280783,10.1479,9.19533,4.44407
> > 1962,0.55382,0.281286,10.197,9.21544,4.45074
> > 1963,0.549951,0.28303,10.2036,9.22841,4.46403
> > 1964,0.547204,0.291287,10.2271,9.23954,4.48447
> > 1965,0.55511,0.281313,10.2882,9.26531,4.52057
> > 1966,0.558182,0.280151,10.353,9.31675,4.58156
> > 1967,0.545735,0.294385,10.3351,9.35382,4.65983
> > 1968,0.538964,0.294593,10.3525,9.38361,4.71804
> > 1969,0.542764,0.299927,10.3676,9.40725,4.76329
> > 1970,0.534595,0.315319,10.2968,9.39139,4.81136
> > 1971,0.545591,0.315828,10.2592,9.34121,4.84082
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.