cwbeitel/kubeflow-rl

Successful jobs exit with error code

cwbeitel opened this issue · 0 comments

Currently all runs that reach the end of main() have an ungraceful system exit. E.g.

class TestRun(unittest.TestCase):

    def test_non_distributed_runs(self):
      os.environ['TF_CONFIG'] = '{"cluster":{"master":["pybullet-kuka-ff-c2f81017-master-v3k7-0:2222"]},"task":{"type":"master","index":0},"environment":"cloud"}'
      tmp_logdir = '/tmp/agents-logs/test/non-distributed-2'
      sys.argv.extend(["--steps=100",
                       "--sync_replicas=False",
                       "--num_agents=1",
                       "--logdir=%s" % tmp_logdir])
      tf.app.run()

Yields

======================================================================
ERROR: test_non_distributed_runs (__main__.TestRun)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/app/trainer/task_test.py", line 34, in test_non_distributed_runs
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
SystemExit

----------------------------------------------------------------------
Ran 1 test in 21.829s

This is problematic for various reasons including jobs on kubflow re-starting thinking the job is failed when it's actually just finished.