Debugging BLCR problem
Tue 10 April 2012 by Dr. Dirk ColbryBlog post edited by Anonymous - "Migration of unmigrated content due to installation of a new plugin"
We have isolated the problem. Details of the solution can be found at the following blog post: https://wiki.hpcc.msu.edu/x/qKXT
We had BLCR working great on our SLES10 HPC system about four months ago (I am not sure what version of BLCR we where running). We have upgraded our system to RHEL6.0 and unfortunately BLCR (0.8.3) no longer works. Well, it works about 80-90% of the time and segfaults another 10-20% when doing the cr_restart command. I have been trying to come up with a reliable test case to submit as a bug but the intermediate nature of the problem is making it really hard to isolate.
I asked the BLCR mailing list and Paul H. Hargrove was quick to reply with some debug suggestions. One possible problem that was suggested is that having different /usr/lib/locale/locale-archive files on different nodes could cause a problem. I did a check using md5sum and we had two different versions of the file installed on the system. I wrote the following submission script designed to force a job to only start on a node with the same file. If the md5sum is different the job just resubmitts itself:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
Unfortunately, all this test showed me was that the difference in the /usr/lib/locale/locale-archive files was not the problem I am debugging (although it could be another problem).
Blogpost migrated from ICER Wiki using custom python script. Comment on errors below.