Blog post edited by Anonymous - "Migration of unmigrated content due to
installation of a new plugin"
Our submission system is set up with a maximum walltime of one week. This
works fine for most users but sometimes it is nice to be able to run a job
even longer. The following script uses Berkley Lab Checkpoint Restart
(BLCR) to
automatically save the current state of a job and submit it back to the
scheduler.
#!/bin/sh -login#PBS -j oe#PBS -l nodes=1:ppn=1,walltime=24:10:00,mem=2gb#PBS -m acd${PBS_O_WORKDIR}#Berkly Lab Checkpoint Restart script to run a job continuously#Written by Dirk Colbry#Job restarts itself every 24 hours or 86400 secondsexportwalltime="86400"exportoutput="output.txt"# Name if main checkpoint fileexportcheckpoint="checkfile.blcr"if["${PBS_ARRAYID}"=""]thenecho"Running for the first time"#### SET UP JOB,#Runs once. Include any job setup commands inside this if block before the cr_run command.#Replace the program "supernova 1000" with your program and input argumentscr_run./supernova1001>${output}2>&1&exportPID=$!exportnext=1elseecho"Restarting ${PBS_ARRAYID}"#Job running as a restart jobcr_restart--no-restore-pid${checkpoint}>>${output}2>&1&exportPID=$!exportnext=$(($PBS_ARRAYID+1))fi#function to run if the program times out
checkpoint_timeout(){echo"Timeout. Checkpointing Job"timecr_checkpoint--term${PID}echo"**********"tail${output}echo""echo"**********"if[!"$?"=="0"]thenecho"Failed to checkpoint"exit2fiecho"Queueing Next Job"chmod644context.${PID}mvcontext.${PID}${checkpoint}qsub-t${next}long_job.qsub
exit0}#set checkpoint timeout(sleep${walltime};echo'Timer Done';checkpoint_timeout;)&timeout=$!echo"starting timer (${timeout}) for ${walltime} seconds"echo"Waiting on $PID"wait${PID}RET=$?#Check to see if job checkpointedif["${RET}"="143"]#Job terminated due to cr_checkpointthenecho"Job seems to have been checkpointed, waiting for checkpoint to complete."wait${timeout}qstat-f${PBS_JOBID}exit0fi## JOB completed#Kill timeout timerkill${timeout}#Output the job statistics
qstat-f${PBS_JOBID}#Email the user that the job has completed
qstat-f${PBS_JOBID}|mail-s"JOB COMPLETE"$USER@msu.edu
echo"Job completed with exit status ${RET}"exit254
The job will keep submitting itself until the code exits successfully.
Some limitations of this script include:
This one uses Job array as an iteration flag so the script will not work in a job array
Currently only been tested on single thread jobs.
A simple modification to this script could be made such that each job runs
less than 4 hours. This would allow them to run on the buy-in nodes.