Workers keep running regardless of retire all workers and close the client
LUOXIAO92 opened this issue · 2 comments
This issue has occurred since I started using rapids-22.08, so I add the following codes at the end of my program.
client.retire_workers()
client.close()
time.sleep(5)
sys.exit()
It works (but about 10% probability fails) until rapids-23.06. But after installing rapids-23.08, this issue occurs more frequently than before nevertheless adding the above code, maybe about 90% of the jobs. So I have to send the delete job command manually to force the jobs exiting.
The log shows that even I retire all workers and close the client, and exit the program manually, the worker will still be resetted and wait to connect to client.
The system is Ubuntu 20.04LTS 5.4.0-155-generic, running on supercomputer cluster.
log
2023-09-24 20:05:36,481 - distributed.scheduler - INFO - Retiring worker ucx://10.110.0.74:57881
2023-09-24 20:05:36,481 - distributed.scheduler - INFO - Retiring worker ucx://10.110.0.73:48217
2023-09-24 20:05:36,481 - distributed.active_memory_manager - INFO - Retiring worker ucx://10.110.0.73:48217; no unique keys need to be moved away.
2023-09-24 20:05:36,481 - distributed.active_memory_manager - INFO - Retiring worker ucx://10.110.0.74:57881; no unique keys need to be moved away.
2023-09-24 20:05:37,522 - distributed.scheduler - INFO - Remove worker <WorkerState 'ucx://10.110.0.74:57881', name: 3, status: closing_gracefully, memory: 0, processing: 0> (stimulus_id='retire-workers-1695553536.4813309')
2023-09-24 20:05:37,522 - distributed.scheduler - INFO - Retired worker ucx://10.110.0.74:57881
2023-09-24 20:05:37,522 - distributed.scheduler - INFO - Remove worker <WorkerState 'ucx://10.110.0.73:48217', name: 2, status: closing_gracefully, memory: 0, processing: 0> (stimulus_id='retire-workers-1695553536.4813309')
2023-09-24 20:05:37,522 - distributed.scheduler - INFO - Lost all workers
2023-09-24 20:05:37,522 - distributed.scheduler - INFO - Retired worker ucx://10.110.0.73:48217
2023-09-24 20:05:37,522 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 8.59s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-09-24 20:05:37,998 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'ucx://10.110.0.73:48217'.
2023-09-24 20:05:37,999 - distributed.worker - INFO - Stopping worker at ucx://10.110.0.74:57881. Reason: scheduler-remove-worker
2023-09-24 20:05:37,999 - distributed.worker - ERROR - Scheduler was unaware of this worker 'ucx://10.110.0.73:48217'. Shutting down.
2023-09-24 20:05:38,000 - distributed.worker - INFO - Stopping worker at ucx://10.110.0.73:48217. Reason: worker-close
2023-09-24 20:05:38,001 - distributed.scheduler - INFO - Remove client Client-d892bf58-5abe-11ee-8d8b-3cecefd36b1c
2023-09-24 20:05:38,001 - distributed.core - INFO - Received 'close-stream' from ucx://10.110.0.73:46438; closing.
2023-09-24 20:05:38,005 - distributed.scheduler - INFO - Remove client Client-d892bf58-5abe-11ee-8d8b-3cecefd36b1c
2023-09-24 20:05:38,006 - distributed.scheduler - INFO - Close client connection: Client-d892bf58-5abe-11ee-8d8b-3cecefd36b1c
2023-09-24 20:05:38,041 - distributed.nanny - INFO - Closing Nanny gracefully at 'ucx://10.110.0.74:35033'. Reason: scheduler-remove-worker
2023-09-24 20:05:38,203 - distributed.core - INFO - Received 'close-stream' from ucx://10.110.0.73:46438; closing.
2023-09-24 20:05:38,203 - distributed.core - INFO - Connection to ucx://10.110.0.73:46438 has been closed.
2023-09-24 20:05:38,244 - distributed.core - INFO - Received 'close-stream' from ucx://10.110.0.73:46438; closing.
2023-09-24 20:05:38,244 - distributed.core - INFO - Connection to ucx://10.110.0.73:46438 has been closed.
2023-09-24 20:05:38,249 - distributed.nanny - INFO - Worker closed
Traceback (most recent call last):
File "/home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/site-packages/distributed/utils.py", line 803, in wrapper
return await func(*args, **kwargs)
File "/home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/site-packages/distributed/worker.py", line 1508, in close
await self.finished()
File "/home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/site-packages/distributed/core.py", line 592, in finished
await self._event_finished.wait()
File "/home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError
2023-09-24 20:05:38,926 - distributed.nanny - INFO - Worker closed
2023-09-24 20:05:40,250 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-09-24 20:05:41,988 - distributed.nanny - INFO - Closing Nanny at 'ucx://10.110.0.74:35033'. Reason: nanny-close-gracefully
2023-09-24 20:05:42,795 - distributed.nanny - WARNING - Restarting worker
2023-09-24 20:05:43,128 - distributed.scheduler - INFO - Receive client connection: Client-4dcaf83e-5aca-11ee-8d8b-3cecefd36b1c
2023-09-24 20:05:43,128 - distributed.core - INFO - Starting established connection to ucx://10.110.0.73:46438
2023-09-24 20:05:43,130 - distributed.worker - INFO - Run out-of-band function 'stop'
2023-09-24 20:05:43,130 - distributed.scheduler - INFO - Scheduler closing...
2023-09-24 20:05:43,131 - distributed.scheduler - INFO - Scheduler closing all comms
2023-09-24 20:05:43,132 - distributed.core - INFO - Connection to ucx://10.110.0.73:46438 has been closed.
Task exception was never retrieved
future: <Task finished name='Task-2415854' coro=<_listener_handler_coroutine() done, defined at /home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/site-packages/ucp/core.py:128> exception=UCXError('<stream_recv>: Connection reset by remote peer')>
Traceback (most recent call last):
File "/home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/site-packages/ucp/core.py", line 143, in _listener_handler_coroutine
peer_info = await exchange_peer_info(
File "/home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/site-packages/ucp/core.py", line 52, in exchange_peer_info
await comm.stream_recv(endpoint, peer_info_arr, peer_info_arr.nbytes)
ucp._libs.exceptions.UCXError: <stream_recv>: Connection reset by remote peer
2023-09-24 20:05:47,531 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-09-24 20:05:47,531 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-09-24 20:05:48,761 - distributed.preloading - INFO - Run preload setup: dask_cuda.initialize
2023-09-24 20:05:48,762 - distributed.worker - INFO - Start worker at: ucx://10.110.0.73:34172
2023-09-24 20:05:48,762 - distributed.worker - INFO - Listening to: ucx://10.110.0.73:34172
2023-09-24 20:05:48,762 - distributed.worker - INFO - Worker name: 2
2023-09-24 20:05:48,762 - distributed.worker - INFO - dashboard at: 10.110.0.73:41059
2023-09-24 20:05:48,762 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:05:48,762 - distributed.worker - INFO - -------------------------------------------------
2023-09-24 20:05:48,762 - distributed.worker - INFO - Threads: 1
2023-09-24 20:05:48,762 - distributed.worker - INFO - Memory: 110.00 GiB
2023-09-24 20:05:48,762 - distributed.worker - INFO - Local Directory: /pmem/dask-scratch-space/worker-0e4qh4zy
2023-09-24 20:05:48,762 - distributed.worker - INFO - Starting Worker plugin RMMSetup-ff3c6fc1-3eda-4279-8fbb-01f0f0d66bbd
2023-09-24 20:05:48,765 - distributed.worker - INFO - Starting Worker plugin PreImport-678cb9f9-0a39-4059-aa1d-45282f784b54
2023-09-24 20:05:48,765 - distributed.worker - INFO - Starting Worker plugin CPUAffinity-0758b721-914d-4d62-b13c-003a98002d3e
2023-09-24 20:05:48,765 - distributed.worker - INFO - -------------------------------------------------
2023-09-24 20:06:18,766 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:06:48,869 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:07:18,973 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:07:49,076 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:08:19,177 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:08:49,279 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:09:19,382 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:09:49,485 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:10:19,588 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:10:49,690 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:11:19,793 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:11:49,894 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:12:19,998 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:12:50,101 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:13:20,205 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:13:50,306 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:14:20,406 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:14:50,510 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:15:20,611 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:15:50,714 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:16:20,815 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:16:50,916 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:17:21,020 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:17:51,121 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:18:21,222 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:18:51,323 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:19:21,424 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:19:51,524 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:20:21,624 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:20:51,727 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:21:21,829 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:21:51,931 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:22:22,032 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:22:52,133 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:23:22,235 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:23:52,336 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:24:22,438 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:24:52,539 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:25:22,640 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:25:52,741 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:26:22,842 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
2023-09-24 20:26:52,944 - distributed.worker - INFO - Waiting to connect to: ucx://10.110.0.73:46438
The following log is the case of exiting the job successfully, but still with a error UCXError: <stream_recv>: Connection reset by remote peer
log
2023-09-24 20:19:25,516 - distributed.scheduler - INFO - Retiring worker ucx://10.110.0.78:58745
2023-09-24 20:19:25,516 - distributed.scheduler - INFO - Retiring worker ucx://10.110.0.77:42783
2023-09-24 20:19:25,516 - distributed.active_memory_manager - INFO - Retiring worker ucx://10.110.0.78:58745; no unique keys need to be moved away.
2023-09-24 20:19:25,516 - distributed.active_memory_manager - INFO - Retiring worker ucx://10.110.0.77:42783; no unique keys need to be moved away.
2023-09-24 20:19:25,525 - distributed.scheduler - INFO - Remove worker <WorkerState 'ucx://10.110.0.78:58745', name: 3, status: closing_gracefully, memory: 0, processing: 0> (stimulus_id='retire-workers-1695554365.5159054')
2023-09-24 20:19:25,525 - distributed.scheduler - INFO - Retired worker ucx://10.110.0.78:58745
2023-09-24 20:19:25,525 - distributed.scheduler - INFO - Remove worker <WorkerState 'ucx://10.110.0.77:42783', name: 2, status: closing_gracefully, memory: 0, processing: 0> (stimulus_id='retire-workers-1695554365.5159054')
2023-09-24 20:19:25,525 - distributed.scheduler - INFO - Lost all workers
2023-09-24 20:19:25,525 - distributed.scheduler - INFO - Retired worker ucx://10.110.0.77:42783
2023-09-24 20:19:25,525 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 10.31s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-09-24 20:19:25,726 - distributed.scheduler - INFO - Remove client Client-8a452bba-5abf-11ee-9e9a-3cecefd36ad4
2023-09-24 20:19:25,726 - distributed.core - INFO - Received 'close-stream' from ucx://10.110.0.77:44963; closing.
2023-09-24 20:19:25,726 - distributed.worker - INFO - Stopping worker at ucx://10.110.0.78:58745. Reason: scheduler-remove-worker
2023-09-24 20:19:25,726 - distributed.worker - INFO - Stopping worker at ucx://10.110.0.77:42783. Reason: scheduler-remove-worker
2023-09-24 20:19:25,730 - distributed.scheduler - INFO - Remove client Client-8a452bba-5abf-11ee-9e9a-3cecefd36ad4
2023-09-24 20:19:25,730 - distributed.scheduler - INFO - Close client connection: Client-8a452bba-5abf-11ee-9e9a-3cecefd36ad4
2023-09-24 20:19:25,764 - distributed.nanny - INFO - Closing Nanny gracefully at 'ucx://10.110.0.77:42948'. Reason: scheduler-remove-worker
2023-09-24 20:19:25,779 - distributed.nanny - INFO - Closing Nanny gracefully at 'ucx://10.110.0.78:33665'. Reason: scheduler-remove-worker
2023-09-24 20:19:25,966 - distributed.core - INFO - Received 'close-stream' from ucx://10.110.0.77:44963; closing.
2023-09-24 20:19:25,966 - distributed.core - INFO - Connection to ucx://10.110.0.77:44963 has been closed.
2023-09-24 20:19:25,973 - distributed.nanny - INFO - Worker closed
2023-09-24 20:19:25,982 - distributed.core - INFO - Received 'close-stream' from ucx://10.110.0.77:44963; closing.
2023-09-24 20:19:25,982 - distributed.core - INFO - Connection to ucx://10.110.0.77:44963 has been closed.
2023-09-24 20:19:25,987 - distributed.nanny - INFO - Worker closed
2023-09-24 20:19:27,974 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-09-24 20:19:27,988 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-09-24 20:19:29,506 - distributed.nanny - INFO - Closing Nanny at 'ucx://10.110.0.77:42948'. Reason: nanny-close-gracefully
2023-09-24 20:19:29,875 - distributed.nanny - INFO - Closing Nanny at 'ucx://10.110.0.78:33665'. Reason: nanny-close-gracefully
2023-09-24 20:19:30,823 - distributed.scheduler - INFO - Receive client connection: Client-3b279277-5acc-11ee-9e9a-3cecefd36ad4
2023-09-24 20:19:30,824 - distributed.core - INFO - Starting established connection to ucx://10.110.0.77:44963
2023-09-24 20:19:30,826 - distributed.worker - INFO - Run out-of-band function 'stop'
2023-09-24 20:19:30,826 - distributed.scheduler - INFO - Scheduler closing...
2023-09-24 20:19:30,827 - distributed.scheduler - INFO - Scheduler closing all comms
2023-09-24 20:19:30,832 - distributed.core - INFO - Connection to ucx://10.110.0.77:44963 has been closed.
Task exception was never retrieved
future: <Task finished name='Task-2621918' coro=<_listener_handler_coroutine() done, defined at /home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/site-packages/ucp/core.py:128> exception=UCXError('<stream_recv>: Connection reset by remote peer')>
Traceback (most recent call last):
File "/home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/site-packages/ucp/core.py", line 143, in _listener_handler_coroutine
peer_info = await exchange_peer_info(
File "/home/DENSEQCD/luoxiao/.conda/envs/rapids-23.08/lib/python3.10/site-packages/ucp/core.py", line 52, in exchange_peer_info
await comm.stream_recv(endpoint, peer_info_arr, peer_info_arr.nbytes)
ucp._libs.exceptions.UCXError: <stream_recv>: Connection reset by remote peer
Could you include some more information to help us understand the problem a little better? See below.
- Could you post the output of
conda list
(if installing the environment via conda or using RAPIDS docker images) orpip list
(if installing the environment via pip). - How many workers does your cluster have when this occurs?
- Does it only happen after you run some compute in the cluster or can you reproduce that as well simply by launching the cluster and immediately running
client.retire_workers()
+client.close()
? - If you use
client.shutdown()
instead of theclient.retire_workers()
+client.close()
, does that occur as well? - Would you be able to test with TCP instead of UCX and check if the same happens?
- Would it be possible to try and reproduce the same with
LocalCluster
/dask worker
instead ofLocalCUDACluster
/dask cuda worker
? Note that you'd probably want to mimick the setup of Dask-CUDA by specifyingnthreads=1
/nworkers=NUM_GPUS
.
The above is in the order I'd try things out, so if, for example, with 5 the problem doesn't happen we could skip 6. I also understand particularly 5 and 6 may be hard for you to try out depending on how the problem manifests, but unfortunately there are so many moving pieces that it is often difficult to go to the source of the problem without us narrowing down the issue first.
Also note that we've been battling issues with shutting down processes for quite a while, and this is a very time-consuming, entangled task, this is why it's been hard to resolve issues like this one. Just for reference, an example of such issue is dask/distributed#7726 .