[R] checkpointing

Sorkin, John j@ork|n @end|ng |rom @om@um@ry|@nd@edu
Sat Dec 18 06:02:21 CET 2021


Colleagues,

I am late to this thread. (It brings me back to my days running checkpoint restart on an IBM 370, which very useful for very, very long jobs). A search for "linux checkpoint restore" retrieved information about CIRU (Checkpoint/Restore in user space) which sounds a lot like the facility I used on the IBM 370. It appears to allow a user's process to be stopped, have its state backed up and then restarted. Perhaps this would solve (at least for Linux users of R or RStudio) the request to have checkpoint restart ability in an R program.

Please let me know if you agree.

John

________________________________________
From: R-help <r-help-bounces using r-project.org> on behalf of Andy Jacobson via R-help <r-help using r-project.org>
Sent: Tuesday, December 14, 2021 8:59 PM
To: Henrik Bengtsson
Cc: Greg Minshall; Andy Jacobson via R-help; Andy Jacobson
Subject: Re: [R] checkpointing

I have been using DMTCP successfully for a long-running optim() task. This is a single-core process running on a large linux cluster with slurm as the job manager. This cluster places an 8-hour limit on individual jobs, and since my cost function takes 11 minutes to compute, I need many such jobs run sequentially. To make DMTCP work, I have had to rework file I/O to avoid references to temporary files written to /tmp, but other than that...optim() is checkpointed just before 8 hours is up, and then resumed successfully in a subsequent batch job running on a different core of the cluster.

While I have an answer for my particular task, it would still be useful to checkpoint using the scheme Henrik suggests. Thanks all for the interesting conversation!

-Andy



On 12/14/21 5:39 PM, Henrik Bengtsson wrote:
> On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson <andy using yovo.org> wrote:
>>
>> Those are good points, Duncan. I am experimenting with a nice checkpointing tool called DMTCP. It operates on the system level but is quite OS-dependent. It can be found at https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdmtcp.sourceforge.net%2Findex.html&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=D7knPv4UR%2FyMl%2BwSBsHeYwnxdBGU4uuCwqyPxXgjbzg%3D&reserved=0.
>>
>> Still, it would be nice to be able to checkpoint calls within R to potentially long-running processes like optim().
>
> Teasing idea. Imagine if we could come up with some de-facto standard
> API for this and that such a framework could be called automatically
> by R. Something similar to how user interrupts are checked (e.g.
> R_CheckUserInterrupt()) on a regular basis by the R engine and
> through-out the R code. That could help troubleshooting and debugging,
> e.g. sending the checkpoint to someone else or going backwards in
> time.
>
> Pasting in the below since I failed to hit Reply *All* the other day,
> and it was only Richard who got it:
>
> A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
> CheckPointing ) for Linux (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmtcp%2Fdmtcp&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=xwfnXt1KJtPHUTW3cyhtgSmdeIiFl4VaiRJJAWRc5p4%3D&reserved=0).  I'm
> sharing in case someone is interested in investigating this further.
> Also, somewhere on the DMTCP wiki, they asked for testing with R by
> more experienced users.
>
> "DMTCP is a tool to transparently checkpoint the state of multiple
> simultaneous applications, including multi-threaded and distributed
> applications. It operates directly on the user binary executable,
> without any Linux kernel modules or other kernel modifications."
>
> They seem to be able to run this with HPC jobs, open files, Linux
> containers, and even MPI, and so on.  I've only tested it very quickly
> with interactive R and it seems to work.  Obviously more testing needs
> to be done to identify when it doesn't work.  For example, I'd have a
> hard time it would work out of the box with local parallel PSOCK
> workers.  They mention "plug-ins", so maybe there's a way to adding
> support for specific use cases on a one by one.
>
> Different academic HPC environment appear to use it, e.g.
>
> * https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
> * https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.orc.gmu.edu%2Fmkdocs%2FCreating_Checkpoints_%2528DMTCP%2529%2F&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=MtPGzaAIKl7RytoJ3%2FCC2o583GHrKz8CkEtLgeMz63I%3D&reserved=0
> * https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.york.ac.uk%2Fdisplay%2FRCS%2FVK21%2529%2BCheckpointing%2Bwith%2BDMTCP&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=wUI8AqONnPtKnW5JP1lXAOx%2FO%2Bkuve6dn8QC7cpb9S8%3D&reserved=0
>
> That's all I have time for now,
>
> Henrik
>
>>
>> -Andy
>>
>> On 12/13/21 11:51 AM, Duncan Murdoch wrote:
>>> On 13/12/2021 12:58 p.m., Greg Minshall wrote:
>>>> Jeff,
>>>>
>>>>> This sounds like an OS feature, not an R feature... certainly not a
>>>>> portable R feature.
>>>>
>>>> i'm not arguing for it, but this seems to me like something that could
>>>> be a language feature.
>>>>
>>>
>>> R functions can call libraries written in other languages, and can start processes, etc.  R doesn't know everything going on in every function call, and would have a lot of trouble saving it.
>>>
>>> If you added some limitations, e.g. a process that periodically has its entire state stored in R variables, then it would be a lot easier.
>>>
>>> Duncan Murdoch
>>
>> --
>> Andy Jacobson
>> andy using yovo.org
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BzjDX4tvLr%2FlvpoMOiQIX75ojE4WRLEkflfzf%2F0h7Bg%3D&reserved=0
>> PLEASE do read the posting guide https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fz3JNl6S2gCGCVT6cPOSoHIOP%2F%2FTaOIqcHf6Vd%2Fbm3U%3D&reserved=0
>> and provide commented, minimal, self-contained, reproducible code.

--
Andy Jacobson
andy.jacobson using noaa.gov

NOAA Global Monitoring Lab
325 Broadway
Boulder, Colorado 80305

303/497-4916

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BzjDX4tvLr%2FlvpoMOiQIX75ojE4WRLEkflfzf%2F0h7Bg%3D&reserved=0
PLEASE do read the posting guide https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fz3JNl6S2gCGCVT6cPOSoHIOP%2F%2FTaOIqcHf6Vd%2Fbm3U%3D&reserved=0
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list