Hack to automatically restart programs that stall during inicialization
Fri 16 January 2015 by Dr. Dirk ColbryBlog post edited by Anonymous - "Migration of unmigrated content due to installation of a new plugin"
Sometimes we get jobs that stall out right at the beginning but do not error out until the walltime for the job has been exceeded. Users get an email saying their job "exceeds walltime" but when they check the output nothing (or very little) seems to have happened. The cause of this problem is highly dependent on what the job is doing. However, in some cases a simple resubmit of the job gets it working. The following scripts check to see if the program is running and automatically re-submits the job if their seems to be a problem.
file_flag_example.qsub
#!/bin/bash -l
#PBS -l mem=100gb,nodes=100:ppn=1,walltime=03:59:59
#PBS -m a
#PBS -j oe
# This method checks to see if a file is created at the beginning of a run. If the file is not created then the job is killed and restarted.
# This method assumes that the program stalls before it has a chance to generate a file. This method is useful if the code does self checkpointing and the failure state is a busy wait.
cd ${PBS_O_WORKDIR}
#Run the main program
(
mpirun mycommand
) &
PID=$!
# Sleep for enough time to see if the job is running
sleep 300
if [ ! -f testfile.flag ]
then
echo "Job Seems to have stalled. Killing and restarting"
kill $PID
qsub $0
echo "Job stats for debugging"
qstat -f ${PBS_JOBID}
exit 1
fi
wait $PID
RET=$?
qstat -f ${PBS_JOBID}
#return the output of the main program
exit $RET
output_monitor_example.qsub
#!/bin/bash -l
#PBS -l mem=100gb,nodes=100:ppn=1,walltime=03:59:59
#PBS -m a
#PBS -j oe
# This method monitors job output and stops if the output doesn't change.
# This method assumes that the program continuously generates output at regular intervals.
cd ${PBS_O_WORKDIR}
testfile=`testfile.flag`
#Run the main program
(
mpirun mycommand > $testfile
) &
PID=$!
# Sleep for enough time to start generating output
sleep 300
linecount1=`cat $testfile | wc -l`
# Sleep enough for more output
sleep 100
linecount2=`cat $testfile | wc -l`
if [ "$linecount1" == "$linecount2" ]
then
echo "Job Seems to have stalled. Killing and restarting"
kill $PID
qsub $0
echo "Job stats for debugging"
qstat -f ${PBS_JOBID}
exit 1
fi
wait $PID
RET=$?
qstat -f ${PBS_JOBID}
#return the output of the main program
exit $RET
qstat_monitor_example.qsub
#!/bin/bash -l
#PBS -l mem=100gb,nodes=100:ppn=1,walltime=03:59:59
#PBS -m a
#PBS -j oe
# This method uses the same idea as the previous but instead of relying on output it uses the cput stat generated by qstat.
# This solution will not work if the job is in a busy wait state.
cd ${PBS_O_WORKDIR}
#Run the main program
(
mpirun mycommand
) &
PID=$!
# Sleep for enough time to start generating output
sleep 300
cpu1=`qstat -f $PBS_JOBID | grep resources_used.cput`
# Sleep enough for more output
sleep 100
cpu2=`qstat -f $PBS_JOBID | grep resources_used.cput`
if [ "$cpu1" == "$cpu2" ]
then
echo "Job Seems to have stalled. Killing and restarting"
kill $PID
qsub $0
echo "Job stats for debugging"
qstat -f ${PBS_JOBID}
exit 1
fi
wait $PID
RET=$?
qstat -f ${PBS_JOBID}
#return the output of the main program
exit $RET
These solutions are nice work arounds because, if it works, the scripts just restarts your job until it runs and gets the research done. However, using this hack does not get at the root of the problem. Actually there are two problems:
- Something is broken causing the job to hang. This could be a race condition in the code, a bad node, bad file I/O, bad network connections, etc. All depends on what the code is doing.
- Code hangs insteads of quitting and reporting an error. Well engineered code should not hang. For example, file and network access should have timeouts so that code is not running forever.
Researchers, should first notify the HPCC if they are using this hack so we can try to track down problems with the nodes. Researchers should also work to modify their code to report an error if something hanges. This will also help track down the problem.
- Dirk
Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.
2014-12-16 HPCC workshop slides and handouts
Blog post edited by Anonymous
I will be teaching my bi-annual Introductory and Advanced HPCC workshops tomorrow. Below are links to my updated slides and handouts. Registration looks lite so feel free to drop in if you have the time. These workshops are being provided as part of IT Services …
read more2014-12-05 Western Michigan University, Introduction to iCER slides
zsh job number autocomplete
Blog post edited by Anonymous - "Migration of unmigrated content due to installation of a new plugin"
We do not directly support zsh users on our system. However, many of our more advanced users enjoy some of the modern and advanced features provided by zsh. One of these users shared a …
read more2014-05-07: Workshop on Managing, Sharing and Moving Big Data
Blog post edited by Camille Archer
This is a new workshop being provided as part of IT Services two day offering of no-charge seminars to faculty and graduate students on technology topics on May 7. More information and registration can be done at the following website:
read moreRestart Stalled Programs
Page edited by Camille Archer - "Migration of unmigrated content due to installation of a new plugin"
Hack to automatically restart programs that stall during inicialization
Unknown User (colbrydi@msu.edu) posted on Jan 16, 2015
Sometimes we get jobs that stall out right at the beginning but do not error …
read more2014-10-23 Advanced High Performance Computing
Blog post edited by Anonymous
Here are the slides for the Advanced HPC class:
2014-10-24_CI-Days Advanced HPCC.pdf
And here is the handout:
2014-10-24_CI-Days Advanced HPCC Handout.pdf
- Dirk
Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.
read more2014-08-20: EDAMAME Workshop at Kellogg Biological Center
Blog post edited by Camille Archer
Attached are copies the slides I am planning to present at the EDAMAME workshop. Information about the class can be found here:
This presentation is a little different than my previous ones since it more for researchers outside of MSU and …
read moreCSE 891 Section 1: Parallel Computing: Fundamentals and Applications
Blog post added by Anonymous
I was asked to give a talk about the HPCC and my research. Here are the slides if anyone is interested:
2014-09-10 Parallel Programming Class.pdf
- Dirk
Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.
read more2014-2015 New Faculty Orientation
Blog post added by Anonymous
Here are the slides for introducing new faculty to iCER. Thanks to Ben Ong for editing and updating these slides for me.
2014-08-25-faculty_orientation.pdf
- Dirk
Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.
read more