New Powertool to help checkpoint jobs
Thu 06 October 2011 by Dr. Dirk ColbryBlog post edited by Xiaoge Wang
In a previous blog post I posted my script for automatically checkpointing jobs using BLCR which enables us to run jobs longer than a week:
http://wiki.hpcc.msu.edu/x/eIHT
I didn't like the complexity of the script so I created a new Powertool to do the same thing. I call this tool "longjob" which requires a few modifications to your submission script. In addition to running jobs longer than a week, using longjob with a four hour walltime has the following advantages:
- Run jobs with unknown walltimes
- Run jobs on the buy-in nodes (which requires 4 hours or less walltime)
- Enables robustness of long jobs due to hardware failure
- Run jobs up to a maintenance window without having to wait for that window to complete
The following are instructions for trying out longjob on our system. First, you start with a a basic submission script. For example, consider the following simple submission script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
To get longjob to work, the following modificaitons need to be made:
- Adjust the walltime to be shorter (I suggest 4 hours or less).
- Wrap all setup-code that only needs to be run once in an if statement that checks for the checkpoint file (checkfile.blcr). This will ensure that the setup-code only runs the first time the script is run because the first time the script is run there should not be a checkpoint file.
- add the "longjob" command before the command in the submission script that you want to checkpoint.
- load the powertools module and turn on aliases. i.e. add the following lines of code to the script:
- shopt -s expand_aliases
- module load powertools
- Set the following enviornment variables as appropriate for your job:
- BLCR_WAIT_SEC number of seconds the job should wait before checkpointing and restarting. (should be less than your walltime, default is 3 hours and 55 minutes).
- PBS_JOBSCRIPT (required) the path and name of the jobscript to use in the restart. Typically this is the same as your main jobscript and by default you can always add the following line:
- export PBS_JOBSCRIPT="$0"
- BLCR_OUTPUT name of the main standardout/standarderr file (Default is output.txt)
- BLCR_CHECKFILE name of the checkpoint file (Default is checkfile.blcr)
The following is a modified example script with the changes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
If everything works as expected, you should be able to qsub the above file and it will resubmit itself every four hours until the job completes. Note, this is a work in progress and I have not tested all cases. For example, one case that could propose a problem is if the main program gets caught in a loop and never exits, in this case the code will keep submitting itself indefinitely.
Please email me (colbrydi@msu.edu) if you end up using this code or if you would like to learn more how longjob is implemented.
- Dirk
Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.