Unhandle RayOnGolemClientError exception during ray down
lucekdudek opened this issue · 1 comments
lucekdudek commented
ray down -y golem-cluster.dev.yaml
2023-11-14 15:08:15,139 WARNING util.py:251 -- Dropping the empty legacy field head_node. head_nodeis not supported for ray>=2.0.0. It is recommended to removehead_node from the cluster config.
2023-11-14 15:08:15,139 WARNING util.py:251 -- Dropping the empty legacy field worker_nodes. worker_nodesis not supported for ray>=2.0.0. It is recommended to removeworker_nodes from the cluster config.
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
2023-11-14 15:08:15,145 WARNING util.py:251 -- Dropping the empty legacy field head_node. head_nodeis not supported for ray>=2.0.0. It is recommended to removehead_node from the cluster config.
2023-11-14 15:08:15,145 WARNING util.py:251 -- Dropping the empty legacy field worker_nodes. worker_nodesis not supported for ray>=2.0.0. It is recommended to removeworker_nodes from the cluster config.
Ray On Golem webserver
Not starting webserver, as it's already running
Fetched IP: 192.168.0.3
Stopped only 5 out of 6 Ray processes within the grace period 16 seconds. Set `-v` to see more details. Remaining processes [psutil.Process(pid=751, name='gcs_server', status='terminated', started='13:56:16')] will be forcefully terminated.
You can also use `--force` to forcefully terminate processes or set higher `--grace-period` to wait longer time for proper termination.
Shared connection to 192.168.0.3 closed.
2023-11-14 15:08:24,367 INFO node_provider.py:173 -- NodeProvider: node0: Terminating node
Traceback (most recent call last):
File "/home/lucjan/Repos/golem-ray/.venv/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2498, in main
return cli()
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/scripts/scripts.py", line 1337, in down
teardown_cluster(
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/autoscaler/_private/commands.py", line 548, in teardown_cluster
provider.terminate_nodes(A)
File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/autoscaler/node_provider.py", line 174, in terminate_nodes
self.terminate_node(node_id)
File "/home/lucjan/Repos/golem-ray/ray_on_golem/provider/node_provider.py", line 138, in terminate_node
terminated_nodes = self._ray_on_golem_client.terminate_node(node_id)
File "/home/lucjan/Repos/golem-ray/ray_on_golem/client/client.py", line 58, in terminate_node
response = self._make_request(
File "/home/lucjan/Repos/golem-ray/ray_on_golem/client/client.py", line 198, in _make_request
raise RayOnGolemClientError(f"{error_message}: {response.text}")
ray_on_golem.client.exceptions.RayOnGolemClientError: Couldn't terminate node: 500 Internal Server Error
Server got itself in trouble
...
shadeofblue commented
@lucekdudek does it still manifest itself?