edf-hpc/verrou

Parallel verrou_dd crashes with ValueError

HadrienG2 opened this issue · 3 comments

When I run verrou_dd in parallel on a test workload of mine, it systematically crashes on the second iteration with this kind of backtrace. Sequential runs work fine on the same workload.

$ VERROU_DD_NUM_THREADS=4 VERROU_DD_NRUNS=4 verrou_dd `pwd`/run.sh `pwd`/cmp.sh
[...]
dd (run #1): trying 6275 + 6275
/root/acts-core/build/IntegrationTests/dd.sym/ca2681d399ee504572a37d53b1416f6f  --( run )-> 
Traceback (most recent call last):
  File "/usr/local/bin/verrou_dd", line 633, in <module>
    main(runScript, cmpScript, algoSearch=ddAlgo)
  File "/usr/local/bin/verrou_dd", line 605, in main
    (refSym, confSymsTab) = ddSym(run, compare)
  File "/usr/local/bin/verrou_dd", line 438, in ddSym
    conf = dd.ddmax(deltas)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 733, in ddmax
    return self.ddgen(c, 0, 1)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 607, in ddgen
    outcome = self._dd(c, n)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 670, in _dd
    (t, cs[i]) = self.test_mix(cs[i], c, self.REMOVE)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 580, in test_mix
    directionbar)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 384, in test_and_resolve
    t = self.test(csubr)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 313, in test
    outcome = self._test(c)
  File "/usr/local/bin/verrou_dd", line 409, in _test
    return vT.run()
  File "/usr/local/bin/verrou_dd", line 127, in run
    return self.runParMax(maxNbPROC)
  File "/usr/local/bin/verrou_dd", line 202, in runParMax
    run=self.pidRunTab.index(pid)                
ValueError: 50 is not in list

My test workload is a bit complicated, but I have it inside of a docker container if that can be useful. Or maybe we can find a simpler reproducer.

This is somewhat related to #8 , in the sense that if the end decision is to change the verrou_dd parallelization algorithm, it may not be worth expending too much energy at fixing the existing one.

I think you get a high score with 12500 symbols.
It looks like a bug in your python scheduler which I want to write again with python3.
If you can keep this workload test for latter, I'm interested.

That's a C++ binary that uses boost + Eigen and is built in O0 mode. I'm not surprised that the symbol table got crazy :) Please ping me when you are done with the python3 port, and it will be my pleasure to torture it as well.

Since v2.3.1, the delta-debug should be robust enough to treat this problem. If not you can open
a new issue.