Hack to automatically restart programs that stall during inicialization

Fri 16 January 2015 by Dr. Dirk Colbry

Blog post edited by Anonymous - "Migration of unmigrated content due to installation of a new plugin"

Sometimes we get jobs that stall out right at the beginning but do not error out until the walltime for the job has been exceeded. Users get an email saying their job "exceeds walltime" but when they check the output nothing (or very little) seems to have happened. The cause of this problem is highly dependent on what the job is doing. However, in some cases a simple resubmit of the job gets it working. The following scripts check to see if the program is running and automatically re-submits the job if their seems to be a problem.

file_flag_example.qsub

    #!/bin/bash -l
    #PBS -l mem=100gb,nodes=100:ppn=1,walltime=03:59:59
    #PBS -m a
    #PBS -j oe

    # This method checks to see if a file is created at the beginning of a run. If the file is not created then the job is killed and restarted.
    # This method assumes that the program stalls before it has a chance to generate a file. This method is useful if the code does self checkpointing and the failure state is a busy wait.
    cd ${PBS_O_WORKDIR}
     
    #Run the main program
    ( 
        mpirun mycommand 
   ) &
    PID=$!

    # Sleep for enough time to see if the job is running
    sleep 300
    if [ ! -f testfile.flag ]
    then
             echo "Job Seems to have stalled. Killing and restarting"
             kill $PID
             qsub $0
             echo "Job stats for debugging"
             qstat -f ${PBS_JOBID}
             exit 1
    fi

    wait $PID
    RET=$?
    qstat -f ${PBS_JOBID}


    #return the output of the main program
    exit $RET

output_monitor_example.qsub

    #!/bin/bash -l
    #PBS -l mem=100gb,nodes=100:ppn=1,walltime=03:59:59
    #PBS -m a
    #PBS -j oe
    # This method monitors job output and stops if the output doesn't change. 
    # This method assumes that the program continuously generates output at regular intervals. 
    cd ${PBS_O_WORKDIR}

    testfile=`testfile.flag`
     
    #Run the main program
    ( 
        mpirun mycommand > $testfile
   ) &
    PID=$!

    # Sleep for enough time to start generating output
    sleep 300
    linecount1=`cat $testfile | wc -l`
     
    # Sleep enough for more output
    sleep 100
    linecount2=`cat $testfile | wc -l`
    if [ "$linecount1" == "$linecount2" ]
    then
             echo "Job Seems to have stalled. Killing and restarting"
             kill $PID
             qsub $0
             echo "Job stats for debugging"
             qstat -f ${PBS_JOBID}
             exit 1
    fi

    wait $PID
    RET=$?
    qstat -f ${PBS_JOBID}


    #return the output of the main program
    exit $RET

qstat_monitor_example.qsub

    #!/bin/bash -l
    #PBS -l mem=100gb,nodes=100:ppn=1,walltime=03:59:59
    #PBS -m a
    #PBS -j oe
    # This method uses the same idea as the previous but instead of relying on output it uses the cput stat generated by qstat.
    # This solution will not work if the job is in a busy wait state. 
    cd ${PBS_O_WORKDIR}

    #Run the main program
    ( 
        mpirun mycommand 
   ) &
    PID=$!

    # Sleep for enough time to start generating output
    sleep 300
    cpu1=`qstat -f $PBS_JOBID | grep resources_used.cput`
     
    # Sleep enough for more output
    sleep 100
    cpu2=`qstat -f $PBS_JOBID | grep resources_used.cput`
    if [ "$cpu1" == "$cpu2" ]
    then
             echo "Job Seems to have stalled. Killing and restarting"
             kill $PID
             qsub $0
             echo "Job stats for debugging"
             qstat -f ${PBS_JOBID}
             exit 1
    fi

    wait $PID
    RET=$?
    qstat -f ${PBS_JOBID}


    #return the output of the main program
    exit $RET

These solutions are nice work arounds because, if it works, the scripts just restarts your job until it runs and gets the research done. However, using this hack does not get at the root of the problem. Actually there are two problems:

  1. Something is broken causing the job to hang. This could be a race condition in the code, a bad node, bad file I/O, bad network connections, etc. All depends on what the code is doing.
  2. Code hangs insteads of quitting and reporting an error. Well engineered code should not hang. For example, file and network access should have timeouts so that code is not running forever.

Researchers, should first notify the HPCC if they are using this hack so we can try to track down problems with the nodes. Researchers should also work to modify their code to report an error if something hanges. This will also help track down the problem.

  • Dirk

Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.


Comments