Successful jobs exit with error code
cwbeitel opened this issue · 0 comments
cwbeitel commented
Currently all runs that reach the end of main() have an ungraceful system exit. E.g.
class TestRun(unittest.TestCase):
def test_non_distributed_runs(self):
os.environ['TF_CONFIG'] = '{"cluster":{"master":["pybullet-kuka-ff-c2f81017-master-v3k7-0:2222"]},"task":{"type":"master","index":0},"environment":"cloud"}'
tmp_logdir = '/tmp/agents-logs/test/non-distributed-2'
sys.argv.extend(["--steps=100",
"--sync_replicas=False",
"--num_agents=1",
"--logdir=%s" % tmp_logdir])
tf.app.run()
Yields
======================================================================
ERROR: test_non_distributed_runs (__main__.TestRun)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/app/trainer/task_test.py", line 34, in test_non_distributed_runs
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
SystemExit
----------------------------------------------------------------------
Ran 1 test in 21.829s
This is problematic for various reasons including jobs on kubflow re-starting thinking the job is failed when it's actually just finished.