[Rd] checkpointing

Tue Jan 3 17:39:27 CET 2006

On Tue, Jan 03, 2006 at 01:26:39PM +0000, Prof Brian Ripley wrote:
> On Tue, 3 Jan 2006, Kasper Daniel Hansen wrote:
> 
> >On Jan 3, 2006, at 9:36 AM, Brian D Ripley wrote:
> >
> >>I use save.image() or save(), which seem exactly what you are asking for.
> >
> >I have the (perhaps unsupported) impression that Ross wanted to save the 
> >progress during the optim run. Since it spends most of its time in the 
> >.Internal(optim(***)) call, save/save.image would not work.
> 
> It certainly does not!  
I'm having trouble following; does that sentence mean the preceding
one is wrong, or that save won't work.

> It is most likely spending time in the callbacks 
> to evaluate the function/gradient.  
Yes.
> We have used save() to save the 
> current information (e.g. current parameter values) from inside optim so a 
> restart could be done, 
Did you do this by
* using an existing feature of optim I don't know about;
* modifying the code for optim
* writing an objective function that saved the parameters with which
  it was called (which, now that I think of it, might be the simplest
  approach)?

My guess was that optim keeps its state in local variables that would
not be captured by a save.image.  Are you saying the relevant
variables are saved and can be fished out if needed?

It would also probably save some time if the estimated matrix of 2nd
derivatives were saved too (I supply only the objective function, not
derivatives), but that's minor compared to having the parameter
values.

> but then I have only once encountered someone 
> running a single optimization for over a week: there normally are ways to 
> speed things up.

I certainly hope so.  However, the problem size is likely to remain
large.

In answer to the other question about using OS checkpointing
facilities, I haven't tried them since the application will be running
on a cluster.  More precisely, the optimization will be driven from a
single machine, but the calculation of the objective function will be
distributed.  So checkpointing at the level of the optimization
function is a good fit to my needs.  There are some cluster OS's that
provide a kind of unified process space across the processors (scyld,
mosix), but we're not using them and checkpointing them is an unsolved
problem.  At least, it was unsolved a couple of years ago when I
looked into it.

Ross