On Demand MakeFlow PBS script
Fri 11 November 2011 by Dr. Dirk ColbryBlog post edited by Anonymous - "Migration of unmigrated content due to installation of a new plugin"
We just installed MakeFlow on our system as an easy to use workflow manager that uses the familiar "makefile" syntax. MakeFlow uses a master node and schedules all of the work off to worker nodes. To get this working on our system I wanted to have an easy way for the master node to communicate it's location and port to the worker nodes. With the help of one of our students we came up with three basic ways for this to work:
- Option 1: Schedule a master job and have it schedule worker jobs.
- Option 2: Schedule a large job and use pbsdsh to call the worker nodes.
- Option 3: Combine Options 1 & 2
Which option you use depends a lot on how PBS is set up on your system and there are different pros and cons to each setup. The following is a description of how I set it up on our HPCC:
Option 1: Schedule a master job and have it schedule worker jobs.
In this option, the system needs to be able to schedule jobs from all of the compute nodes. If this is the case, then it is easy to pass the host information to the worker nodes though a system variable. The job array can have as many single node jobs as it needs and the jobs will get scheduled as resources become available. The downside to this approach is that the job has to wait for the workers to be scheduled before any work can get done. Here is an example script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Option 2: Schedule a large job and use pbsdsh to call the worker nodes.
This option can be used if the system is not set up to schedule jobs from the compute nodes. It has the added benefit of starting immediately. However, the downside to this approach is that the scheduler needs to be able to schedule a large block of nodes together which may cause longer queue times. Here is the example script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
Option 3: Combine Options 1 & 2
The third option is to combine Options 1 & 2 together. This way the job can get started right away and grow as additional workers get added to the queue.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
Future Work
I need to be careful with this script because I have not tested the behavior when two master jobs end up getting scheduled on the same node and use the same port. I will need to put in a test to make sure this doesn't happen. I also am planning on wrapping everything into a simple command that will hide all of the details from the user.
Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.