Specifying Worker Listen Port
otavioon opened this issue · 2 comments
Greetings!
I am encountering an issue when specifying the port for the worker to listen on. When using the traditional Dask Distributed with dask-worker
(excluding GPU usage), I can utilize the --worker-port
parameter to define this behavior. However, with dask-cuda-worker
(version 23.10.0), I am unable to locate any option for this purpose, except for the --host
parameter.
Consequently, when I execute the following command: CUDA_VISIBLE_DEVICES=0 dask-cuda-worker --scheduler-file scheduler.json --host 127.0.0.1:12345
, it results in the following error:
warnings.warn(f'''
2023-09-29 13:39:00,329 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-bpnddwo9', purging
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-09-29 13:39:00,338 - distributed.worker - ERROR - Failed to log closing event
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
await self.listen(start_address, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
listener = await listen(
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
sockets = netutil.bind_sockets(
File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1540, in close
self.log_event(self.address, {"action": "closing-worker", "reason": reason})
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 723, in address
raise ValueError("cannot get address of non-running Server")
ValueError: cannot get address of non-running Server
2023-09-29 13:39:00,340 - distributed.worker - INFO - Stopping worker. Reason: failure-to-start-<class 'OSError'>
2023-09-29 13:39:00,340 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2023-09-29 13:39:00,341 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
await self.listen(start_address, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
listener = await listen(
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
sockets = netutil.bind_sockets(
File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
async with worker:
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
await self
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,386 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
await self.listen(start_address, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
listener = await listen(
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
sockets = netutil.bind_sockets(
File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
result = await self.process.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
msg = await self._wait_until_connected(uid)
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
raise msg["exception"]
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
async with worker:
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
await self
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:12345'. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Nanny asking worker to close. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,406 - distributed.nanny - INFO - Worker process 15064 was killed by signal 15
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
await self.listen(start_address, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
listener = await listen(
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
sockets = netutil.bind_sockets(
File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 362, in start_unsafe
response = await self.instantiate()
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
result = await self.process.start()
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
msg = await self._wait_until_connected(uid)
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
raise msg["exception"]
File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
async with worker:
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
await self
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.local/bin/dask-cuda-worker", line 8, in <module>
sys.exit(worker())
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 442, in worker
loop.run_sync(run)
File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 434, in run
await worker
File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cuda_worker.py", line 244, in _wait
await asyncio.gather(*self.nannies)
File "/usr/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
Without using the --host
parameter, everything functions as expected, although I am unable to specify the desired port. Is there a method to achieve this?
IIRC, --host
should only bind to the IP address, so specifying a port as well will indeed not work. I guess the --worker-port
parameter was just never needed and thus never added, but there's no technical reason it's not there.
If the --worker-port
is important for your use case, would care to submit a pull request with that?
Hello,
Sorry the delay and thanks for your reply, @pentschev. I will submit a PR adding these options.