Restart Stalled Programs
Tue 28 October 2014 by Dr. Dirk ColbryPage edited by Camille Archer - "Migration of unmigrated content due to installation of a new plugin"
Hack to automatically restart programs that stall during inicialization
Unknown User (colbrydi@msu.edu) posted on Jan 16, 2015
Sometimes we get jobs that stall out right at the beginning but do not error out until the walltime for the job has been exceeded. Users get an email saying their job "exceeds walltime" but when they check the output nothing (or very little) seems to have happened. The cause of this problem is highly dependent on what the job is doing. However, in some cases a simple resubmit of the job gets it working. The following scripts check to see if the program is running and automatically re-submits the job if their seems to be a problem.
file_flag_example.qsub
1 2 3 4 5 6 7 8 9 10 11 12 |
|
) & PID=$!
# Sleep for enough time to see if the job is running
sleep 300
if [ ! -f testfile.flag ]
then
echo "Job Seems to have stalled. Killing and restarting"
kill $PID
qsub $0
echo "Job stats for debugging"
qstat -f ${PBS_JOBID}
exit 1
fi
wait $PID
RET=$?
qstat -f ${PBS_JOBID}
#return the output of the main program
exit $RET
output_monitor_example.qsub
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
) & PID=$!
# Sleep for enough time to start generating output
sleep 300
linecount1=`cat $testfile | wc -l`
# Sleep enough for more output
sleep 100
linecount2=`cat $testfile | wc -l`
if [ "$linecount1" == "$linecount2" ]
then
echo "Job Seems to have stalled. Killing and restarting"
kill $PID
qsub $0
echo "Job stats for debugging"
qstat -f ${PBS_JOBID}
exit 1
fi
wait $PID
RET=$?
qstat -f ${PBS_JOBID}
#return the output of the main program
exit $RET
qstat_monitor_example.qsub
1 2 3 4 5 6 7 8 9 10 11 |
|
) & PID=$!
# Sleep for enough time to start generating output
sleep 300
cpu1=`qstat -f $PBS_JOBID | grep resources_used.cput`
# Sleep enough for more output
sleep 100
cpu2=`qstat -f $PBS_JOBID | grep resources_used.cput`
if [ "$cpu1" == "$cpu2" ]
then
echo "Job Seems to have stalled. Killing and restarting"
kill $PID
qsub $0
echo "Job stats for debugging"
qstat -f ${PBS_JOBID}
exit 1
fi
wait $PID
RET=$?
qstat -f ${PBS_JOBID}
#return the output of the main program
exit $RET
These solutions are nice work arounds because, if it works, the scripts just restarts your job until it runs and gets the research done. However, using this hack does not get at the root of the problem. Actually there are two problems:
- Something is broken causing the job to hang. This could be a race condition in the code, a bad node, bad file I/O, bad network connections, etc. All depends on what the code is doing.
- Code hangs insteads of quitting and reporting an error. Well engineered code should not hang. For example, file and network access should have timeouts so that code is not running forever.
Researchers, should first notify the HPCC if they are using this hack so we can try to track down problems with the nodes. Researchers should also work to modify their code to report an error if something hanges. This will also help track down the problem.
-
Dirk
[ ]( /display/~colbrydi@msu.edu
) 2014-12-16 HPCC workshop slides and handouts
Unknown User (colbrydi@msu.edu) posted on Dec 15, 2014
I will be teaching my bi-annual Introductory and Advanced HPCC workshops tomorrow. Below are links to my updated slides and handouts. Registration looks lite so feel free to drop in if you have the time. These workshops are being provided as part of IT Services two day offering of no-charge seminars to faculty and graduate students on technology topics on December 16 and 17. More information and registration can be done at the following website:
http://tech.msu.edu/events/tech-seminars/
Introduction to the HPCC: A Hands-On Introduction to High Performance Computing at MSU
- Tuesday, December 16th 8:30am-11:30am in 403 of the Computer Center
- During this workshop, participants will learn through hands-on examples how to get started with the MSU High Performance Computing Center (HPCC). Topics include: connecting to the HPCC; copying files to your home directories; navigating the command line interface; accessing available software; testing and running programs interactively; editing files; writing job scripts; submitting jobs to the queue; and monitoring jobs running on the system.
- This is an interactive workshop held in a computer lab, although participants are encouraged to bring their research laptops if applicable. Multiple instructors will be available to help you get started and diagnose problems.
12:00-1:15pm Lunch is provided in the MSU Ballroom (registered participants only)
Making Your Research Go Faster: Advance Topics in Getting the Most Out of the MSU HPCC
- Tuesday, December 16th 1:30pm-4:30pm in 403 of the Computer Center
- During this workshop, participants already familiar with using the HPCC systems will be shown advanced techniques on how to use the system more effectively. Topics include: techniques for effective scheduling of pleasantly parallel jobs; shared memory jobs and shared network jobs; running jobs longer than a week; and making jobs fault tolerant.
- This is an interactive workshop held in a computer lab, although participants are encouraged to bring their research laptops if applicable. Multiple instructors will be available to help get you started and diagnose problems.
SLIDES:
HANDOUTS:
- 2014-12-16 [HPCC_WORKSHOP_COMMANDS.pdf](./images/HPCC_WORKSHOP_COMMANDS.pdf)
- HPCC_Command_Summary.pdf
- dirk
[ ]( /display/~colbrydi@msu.edu
) 2014-12-05 Western Michigan University, Introduction to iCER slides
Unknown User (colbrydi@msu.edu) posted on Dec 04, 2014
Here are a copy of my slides and the handout for our two hour introductory talk at Western Michigan University:
[ ]( /display/~colbrydi@msu.edu
Unknown User (colbrydi@msu.edu) posted on Nov 16, 2014
We do not directly support zsh users on our system. However, many of our more advanced users enjoy some of the modern and advanced features provided by zsh. One of these users shared a code snippet that he uses in his ~/.zshrc file to autocomplete job id numbers. He is letting us share the code for use to our users who prefer zsh over bash.
.zshrc
_jobs_list() {
qstat | grep $USER | cut -d' ' -f1 | cut -d'.' -f1
}
_jshow_complete() {
if (( CURRENT ==2)); then
jjobs=( $(_jobs_list))
_multi_parts / jjobs
else
_files
fi
}
compdef _jshow_complete showstart jdel checkjob qstat
This code should work on any system that used PBS Torque.
I hope you find it useful,
[ ]( /display/~colbrydi@msu.edu
) 2014-10-23 Advanced High Performance Computing
Unknown User (colbrydi@msu.edu) posted on Oct 23, 2014
Here are the slides for the Advanced HPC class:
2014-10-24_CI-Days Advanced HPCC.pdf
And here is the handout:
2014-10-24_CI-Days Advanced HPCC Handout.pdf
View Online * View Changes Online
Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.