[R] How to checkpoint-restart R jobs in batch mode?
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue Oct 14 16:29:05 CEST 2008
On Tue, 14 Oct 2008, Mizanur Khondoker wrote:
> Dear list,
> Most high performance computing clusters/grid engines have some
> restrictions on how long a job can be run in batch mode.
> The cluster I am using has maximum of 48 hours limit, but my job would take
> far more than that.
> I know that it is possible to checkpoint jobs without modifying the code if
> some specialized software (e.g., BLCR ) is installed on the grid engine.
> However, I am looking for a solution when this kind of facility is not
> available on the cluster, for example , by modifying the code so that the
> job can checkpoint and restart by itself.
> Does anyone have any experience or idea of doing so? Any help would be
> greatly appreciated.
Yes, we've done this for many years, generally by saving the workspace
every few hours (in our case say every 100 simulation runs), and making
sure that the workspace contains enough information to restart at the save
points. This approach does depend on the run coming back to a simply
reproducible point fairly often: if it is a simulation running entirely in
C++ code in a package you have little hope.
> Mizanur Khondoker
> Division of Pathway Medicine (DPM)
> The University of Edinburgh Medical School
> The Chancellor's Building
> 49 Little France Crescent
> Edinburgh EH16 4SB
> United Kingdom
> Tel: +44 (0) 131 242 6287
> Fax: +44 (0) 131 242 6244
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help