[R] Very slow optim()

Sat Mar 13 15:05:45 CET 2021

As per my post on this, it is important to distinguish between
"CG" as a general approach and optim::CG. The latter -- my algorithm 22
from Compact Numerical Methods for Computers in 1979 -- never worked
terribly well. But Rcgmin and Rtnmin from optimx often (but not always)
perform quite well.

There are some developments that I've poked at a few times. If the
optimization of very large numbers of parameters as a general (rather
than specific problem type), then this would be worth pursuing for R.
However, we need some representative, and easy to set up, test problems.
Otherwise we get this very hand-waving list discussion.

JN

On 2021-03-13 6:56 a.m., Deepayan Sarkar wrote:
> On Sat, Mar 13, 2021 at 4:33 PM Spencer Graves
> <spencer.graves using effectivedefense.org> wrote:
>>
>> Hi, Deepayan:
>>
>>
>> On 2021-03-13 01:27, Deepayan Sarkar wrote:
>>> On Sat, Mar 13, 2021 at 10:08 AM Spencer Graves
>>> <spencer.graves using effectivedefense.org> wrote:
>>>>
>>>> TWO COMMENTS:
>>>>
>>>>
>>>> 1.  DID YOU ASSIGN THE OUTPUT OF "optim" to an object, like "est <-
>>>> optim(...)"?  If yes and if "optim" terminated normally, the 60,000+
>>>> paramters should be there as est$par.  See the documentation on "optim".
>>>>
>>>>
>>>> 2.  WHAT PROBLEM ARE YOU TRYING TO SOLVE?
>>>>
>>>>
>>>>            I hope you will forgive me for being blunt (or perhaps bigoted), but
>>>> I'm skeptical about anyone wanting to use optim to estimate 60,000+
>>>> parameters.  With a situation like that, I think you would be wise to
>>>> recast the problem as one in which those 60,000+ parameters are sampled
>>>> from some hyperdistribution characterized by a small number of
>>>> hyperparameters.  Then write a model where your observations are sampled
>>>> from distribution(s) controlled by these random parameters.  Then
>>>> multiply the likelihood of the observations by the likelihood of the
>>>> hyperdistribution and integrate out the 60,000+ parameters, leaving only
>>>> a small number hyperparameters.
>>>
>>> Just a comment on this comment: I think it's perfectly reasonable to
>>> optimize 60k+ parameters with conjugate gradient. CG was originally
>>> developed to solve linear equations of the form Ax=b. If x was not
>>> large in size, one would just use solve(A, b) instead of an iterative
>>> method.
>>>
>>> Use of CG is quite common in image processing. A relatively small
>>> 300x300 image will give you 90k parameters.
>>>
>>> -Deepayan
>>>
>>
>>           Thanks for this.
>>
>>
>>           If both A and b are 300x300, then x will also be 300x300.
> 
> Sorry for being unclear: the images themselves (b or x) are viewed as
> a vector, so would be 90000x1. A would be 90000x90000, so essentially
> impossible to construct. CG can solve Ax=b as long as Ax can be
> evaluated (for arbitrary x).
> 
>>           What do you do in this case if A is not square or even ill conditioned?
> 
> A has to be p.d.
> 
>>           Do you care if you get only one of many possible or approximate
>> solutions, and the algorithm spends most of its time making adjustments
>> in a singular subspace that would have best been avoided?
> 
> Well, in my experience, optim(method="CG") behaves quite badly if A is
> not full-rank. I don't think other implementations will be any better,
> but I am not sure.
> 
> If you are interested in practical uses of this, we can discuss more off-list.
> 
> -Deepayan
> 
> 
>>           Spencer
>>
>>
>>