Blog post edited by Anonymous - "Migration of unmigrated content due to installation of a new plugin"

Sometimes we get jobs that stall out right at the beginning but do not error out until the walltime for the job has been exceeded. Users get an email saying their job "exceeds walltime" but when they check the output nothing (or very little) seems to have happened. The cause of this problem is highly dependent on what the job is doing. However, in some cases a simple resubmit of the job gets it working. The following scripts check to see if the program is running and automatically re-submits the job if their seems to be a problem.

file_flag_example.qsub

    #!/bin/bash -l
    #PBS -l mem=100gb,nodes=100:ppn=1,walltime=03:59:59
    #PBS -m a
    #PBS -j oe

    # This method checks to see if a file is created at the beginning of a run. If the file is not created then the job is killed and restarted.
    # This method assumes that the program stalls before it has a chance to generate a file. This method is useful if the code does self checkpointing and the failure state is a busy wait.
    cd ${PBS_O_WORKDIR}
     
    #Run the main program
    ( 
        mpirun mycommand 
   ) &
    PID=$!

    # Sleep for enough time to see if the job is running
    sleep 300
    if [ ! -f testfile.flag ]
    then
             echo "Job Seems to have stalled. Killing and restarting"
             kill $PID
             qsub $0
             echo "Job stats for debugging"
             qstat -f ${PBS_JOBID}
             exit 1
    fi

    wait $PID
    RET=$?
    qstat -f ${PBS_JOBID}


    #return the output of the main program
    exit $RET

output_monitor_example.qsub

    #!/bin/bash -l
    #PBS -l mem=100gb,nodes=100:ppn=1,walltime=03:59:59
    #PBS -m a
    #PBS -j oe
    # This method monitors job output and stops if the output doesn't change. 
    # This method assumes that the program continuously generates output at regular intervals. 
    cd ${PBS_O_WORKDIR}

    testfile=`testfile.flag`
     
    #Run the main program
    ( 
        mpirun mycommand > $testfile
   ) &
    PID=$!

    # Sleep for enough time to start generating output
    sleep 300
    linecount1=`cat $testfile | wc -l`
     
    # Sleep enough for more output
    sleep 100
    linecount2=`cat $testfile | wc -l`
    if [ "$linecount1" == "$linecount2" ]
    then
             echo "Job Seems to have stalled. Killing and restarting"
             kill $PID
             qsub $0
             echo "Job stats for debugging"
             qstat -f ${PBS_JOBID}
             exit 1
    fi

    wait $PID
    RET=$?
    qstat -f ${PBS_JOBID}


    #return the output of the main program
    exit $RET

qstat_monitor_example.qsub

    #!/bin/bash -l
    #PBS -l mem=100gb,nodes=100:ppn=1,walltime=03:59:59
    #PBS -m a
    #PBS -j oe
    # This method uses the same idea as the previous but instead of relying on output it uses the cput stat generated by qstat.
    # This solution will not work if the job is in a busy wait state. 
    cd ${PBS_O_WORKDIR}

    #Run the main program
    ( 
        mpirun mycommand 
   ) &
    PID=$!

    # Sleep for enough time to start generating output
    sleep 300
    cpu1=`qstat -f $PBS_JOBID | grep resources_used.cput`
     
    # Sleep enough for more output
    sleep 100
    cpu2=`qstat -f $PBS_JOBID | grep resources_used.cput`
    if [ "$cpu1" == "$cpu2" ]
    then
             echo "Job Seems to have stalled. Killing and restarting"
             kill $PID
             qsub $0
             echo "Job stats for debugging"
             qstat -f ${PBS_JOBID}
             exit 1
    fi

    wait $PID
    RET=$?
    qstat -f ${PBS_JOBID}


    #return the output of the main program
    exit $RET

These solutions are nice work arounds because, if it works, the scripts just restarts your job until it runs and gets the research done. However, using this hack does not get at the root of the problem. Actually there are two problems:

  1. Something is broken causing the job to hang. This could be a race condition in the code, a bad node, bad file I/O, bad network connections, etc. All depends on what the code is doing.
  2. Code hangs insteads of quitting and reporting an error. Well engineered code should not hang. For example, file and network access should have timeouts so that code is not running forever.

Researchers, should first notify the HPCC if they are using this hack so we can try to track down problems with the nodes. Researchers should also work to modify their code to report an error if something hanges. This will also help track down the problem.

  • Dirk

Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.


2014-12-16 HPCC workshop slides and handouts

Mon 15 December 2014 by Dr. Dirk Colbry

Blog post edited by Anonymous

I will be teaching my bi-annual Introductory and Advanced HPCC workshops tomorrow. Below are links to my updated slides and handouts. Registration looks lite so feel free to drop in if you have the time. These workshops are being provided as part of IT Services …

read more

2014-12-05 Western Michigan University, Introduction to iCER slides

Thu 04 December 2014 by Dr. Dirk Colbry

Blog post edited by Anonymous

Here are a copy of my slides and the handout for our two hour introductory talk at Western Michigan University:

View Online

Blogpost migrated from ICER Wiki using custom python script. Comment on errors …

read more

zsh job number autocomplete

Sun 16 November 2014 by Dr. Dirk Colbry

Blog post edited by Anonymous - "Migration of unmigrated content due to installation of a new plugin"

We do not directly support zsh users on our system. However, many of our more advanced users enjoy some of the modern and advanced features provided by zsh. One of these users shared a …

read more

2014-05-07: Workshop on Managing, Sharing and Moving Big Data

Tue 28 October 2014 by Dr. Dirk Colbry

Blog post edited by Camille Archer

This is a new workshop being provided as part of IT Services two day offering of no-charge seminars to faculty and graduate students on technology topics on May 7. More information and registration can be done at the following website:

http://tech.msu.edu …

read more

Restart Stalled Programs

Tue 28 October 2014 by Dr. Dirk Colbry

Page edited by Camille Archer - "Migration of unmigrated content due to installation of a new plugin"

User icon: colbrydi@msu.edu Hack to automatically restart programs that stall during inicialization

Unknown User (colbrydi@msu.edu) posted on Jan 16, 2015

Sometimes we get jobs that stall out right at the beginning but do not error …

read more

2014-10-23 Advanced High Performance Computing

Thu 23 October 2014 by Dr. Dirk Colbry

Blog post edited by Anonymous

Here are the slides for the Advanced HPC class:

2014-10-24_CI-Days Advanced HPCC.pdf

And here is the handout:

2014-10-24_CI-Days Advanced HPCC Handout.pdf

  • Dirk

View Online

Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.

read more

2014-08-20: EDAMAME Workshop at Kellogg Biological Center

Mon 20 October 2014 by Dr. Dirk Colbry

Blog post edited by Camille Archer

Attached are copies the slides I am planning to present at the EDAMAME workshop. Information about the class can be found here:

http://edamame-course.org/

This presentation is a little different than my previous ones since it more for researchers outside of MSU and …

read more

CSE 891 Section 1: Parallel Computing: Fundamentals and Applications

Wed 10 September 2014 by Dr. Dirk Colbry

Blog post added by Anonymous

I was asked to give a talk about the HPCC and my research. Here are the slides if anyone is interested:

2014-09-10 Parallel Programming Class.pdf

  • Dirk

View Online

Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.

read more

2014-2015 New Faculty Orientation

Mon 25 August 2014 by Dr. Dirk Colbry

Blog post added by Anonymous

Here are the slides for introducing new faculty to iCER. Thanks to Ben Ong for editing and updating these slides for me.

2014-08-25-faculty_orientation.pdf

  • Dirk

View Online

Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.

read more