rapidsai/dask-cuda

Specifying Worker Listen Port

otavioon opened this issue · 2 comments

Greetings!

I am encountering an issue when specifying the port for the worker to listen on. When using the traditional Dask Distributed with dask-worker (excluding GPU usage), I can utilize the --worker-port parameter to define this behavior. However, with dask-cuda-worker (version 23.10.0), I am unable to locate any option for this purpose, except for the --host parameter.
Consequently, when I execute the following command: CUDA_VISIBLE_DEVICES=0 dask-cuda-worker --scheduler-file scheduler.json --host 127.0.0.1:12345, it results in the following error:

warnings.warn(f'''
2023-09-29 13:39:00,329 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-bpnddwo9', purging
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-09-29 13:39:00,338 - distributed.worker - ERROR - Failed to log closing event
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1540, in close
    self.log_event(self.address, {"action": "closing-worker", "reason": reason})
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 723, in address
    raise ValueError("cannot get address of non-running Server")
ValueError: cannot get address of non-running Server
2023-09-29 13:39:00,340 - distributed.worker - INFO - Stopping worker. Reason: failure-to-start-<class 'OSError'>
2023-09-29 13:39:00,340 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2023-09-29 13:39:00,341 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
    async with worker:
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
    await self
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,386 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
    result = await self.process.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
    msg = await self._wait_until_connected(uid)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
    raise msg["exception"]
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
    async with worker:
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
    await self
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:12345'. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Nanny asking worker to close. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,406 - distributed.nanny - INFO - Worker process 15064 was killed by signal 15
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 362, in start_unsafe
    response = await self.instantiate()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
    result = await self.process.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
    msg = await self._wait_until_connected(uid)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
    raise msg["exception"]
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
    async with worker:
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
    await self
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/bin/dask-cuda-worker", line 8, in <module>
    sys.exit(worker())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 442, in worker
    loop.run_sync(run)
  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 434, in run
    await worker
  File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cuda_worker.py", line 244, in _wait
    await asyncio.gather(*self.nannies)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.

Without using the --host parameter, everything functions as expected, although I am unable to specify the desired port. Is there a method to achieve this?

IIRC, --host should only bind to the IP address, so specifying a port as well will indeed not work. I guess the --worker-port parameter was just never needed and thus never added, but there's no technical reason it's not there.

If the --worker-port is important for your use case, would care to submit a pull request with that?

Hello,

Sorry the delay and thanks for your reply, @pentschev. I will submit a PR adding these options.