[R] checkpointing

Henrik Bengtsson henr|k@bengt@@on @end|ng |rom gm@||@com
Wed Dec 15 01:39:28 CET 2021


On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson <andy using yovo.org> wrote:
>
> Those are good points, Duncan. I am experimenting with a nice checkpointing tool called DMTCP. It operates on the system level but is quite OS-dependent. It can be found at http://dmtcp.sourceforge.net/index.html.
>
> Still, it would be nice to be able to checkpoint calls within R to potentially long-running processes like optim().

Teasing idea. Imagine if we could come up with some de-facto standard
API for this and that such a framework could be called automatically
by R. Something similar to how user interrupts are checked (e.g.
R_CheckUserInterrupt()) on a regular basis by the R engine and
through-out the R code. That could help troubleshooting and debugging,
e.g. sending the checkpoint to someone else or going backwards in
time.

Pasting in the below since I failed to hit Reply *All* the other day,
and it was only Richard who got it:

A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp).  I'm
sharing in case someone is interested in investigating this further.
Also, somewhere on the DMTCP wiki, they asked for testing with R by
more experienced users.

"DMTCP is a tool to transparently checkpoint the state of multiple
simultaneous applications, including multi-threaded and distributed
applications. It operates directly on the user binary executable,
without any Linux kernel modules or other kernel modifications."

They seem to be able to run this with HPC jobs, open files, Linux
containers, and even MPI, and so on.  I've only tested it very quickly
with interactive R and it seems to work.  Obviously more testing needs
to be done to identify when it doesn't work.  For example, I'd have a
hard time it would work out of the box with local parallel PSOCK
workers.  They mention "plug-ins", so maybe there's a way to adding
support for specific use cases on a one by one.

Different academic HPC environment appear to use it, e.g.

* https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
* http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/
* https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP

That's all I have time for now,

Henrik

>
> -Andy
>
> On 12/13/21 11:51 AM, Duncan Murdoch wrote:
> > On 13/12/2021 12:58 p.m., Greg Minshall wrote:
> >> Jeff,
> >>
> >>> This sounds like an OS feature, not an R feature... certainly not a
> >>> portable R feature.
> >>
> >> i'm not arguing for it, but this seems to me like something that could
> >> be a language feature.
> >>
> >
> > R functions can call libraries written in other languages, and can start processes, etc.  R doesn't know everything going on in every function call, and would have a lot of trouble saving it.
> >
> > If you added some limitations, e.g. a process that periodically has its entire state stored in R variables, then it would be a lot easier.
> >
> > Duncan Murdoch
>
> --
> Andy Jacobson
> andy using yovo.org
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list