Running jobs longer than one week using BLCR

Thu 21 April 2011 by Dr. Dirk Colbry

Blog post edited by Anonymous - "Migration of unmigrated content due to installation of a new plugin"

Our submission system is set up with a maximum walltime of one week. This works fine for most users but sometimes it is nice to be able to run a job even longer. The following script uses Berkley Lab Checkpoint Restart (BLCR) to automatically save the current state of a job and submit it back to the scheduler.

long_job.qsub

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
#!/bin/sh -login
#PBS -j oe
#PBS -l nodes=1:ppn=1,walltime=24:10:00,mem=2gb
#PBS -m a

cd ${PBS_O_WORKDIR}

#Berkly Lab Checkpoint Restart script to run a job continuously
#Written by Dirk Colbry

#Job restarts itself every 24 hours or 86400 seconds
export walltime="86400"

export output="output.txt"

# Name if main checkpoint file
export checkpoint="checkfile.blcr"

if [ "${PBS_ARRAYID}" = "" ]
then
    echo "Running for the first time"
    #### SET UP JOB,
        #Runs once.  Include any job setup commands inside this if block before the cr_run command.

        #Replace the program "supernova 1000" with your program and input arguments
    cr_run ./supernova 100 1> ${output} 2>&1 &
    export PID=$!
    export next=1
else
    echo "Restarting ${PBS_ARRAYID}"
        #Job running as a restart job
    cr_restart --no-restore-pid ${checkpoint} >> ${output} 2>&1 &
    export PID=$!
    export next=$(($PBS_ARRAYID+1))
fi

#function to run if the program times out
checkpoint_timeout() {
  echo "Timeout. Checkpointing Job"

  time cr_checkpoint --term ${PID}
  echo "**********"
  tail ${output}
  echo ""
  echo "**********"

  if [ ! "$?" == "0" ]
  then
        echo "Failed to checkpoint"
        exit 2
  fi

  echo "Queueing Next Job"
  chmod 644 context.${PID}
  mv context.${PID} ${checkpoint}
  qsub -t ${next} long_job.qsub

  exit 0
}

#set checkpoint timeout
(sleep ${walltime}; echo 'Timer Done'; checkpoint_timeout;) &
timeout=$!
echo "starting timer (${timeout}) for ${walltime} seconds"

echo "Waiting on $PID"
wait ${PID}
RET=$?

#Check to see if job checkpointed
if [ "${RET}" = "143" ] #Job terminated due to cr_checkpoint
then
    echo "Job seems to have been checkpointed, waiting for checkpoint to complete."
    wait ${timeout}
    qstat -f ${PBS_JOBID}
    exit 0
fi

## JOB completed

#Kill timeout timer
kill ${timeout}

#Output the job statistics
qstat -f ${PBS_JOBID}

#Email the user that the job has completed
qstat -f ${PBS_JOBID} | mail -s "JOB COMPLETE" $USER@msu.edu
echo "Job completed with exit status ${RET}"
exit 254

The job will keep submitting itself until the code exits successfully.

Some limitations of this script include:

  • This one uses Job array as an iteration flag so the script will not work in a job array
  • Currently only been tested on single thread jobs.

A simple modification to this script could be made such that each job runs less than 4 hours. This would allow them to run on the buy-in nodes.

Hope you find this useful,

  • Dirk

View Online

Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.


Comments