microsoft/DeepSpeed

[BUG] Cannot use --hostfile to start multi-node training in Docker.

Ind1x1 opened this issue · 2 comments

Describe the bug
I used an overlay network to connect containers on two hosts for communication, and configured passwordless SSH along with the relevant /etc/hosts and hostfile. However, I was unable to start training with the command deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py . After checking deepspeed.ai, I found that I can start training using the "Launching without passwordless SSH" method with the command
deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=10.0.1.13 test.py I would like to know what is causing this issue.

These are the log from my training and some configurations.
`root@903c1e9c351c:/home/user/code# deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py
[2024-12-16 07:28:11,223] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 470, in main
subprocess.check_call(safe_ssh_cmd, stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-o', 'PasswordAuthentication=no', 'manager', 'hostname']' returned non-zero exit status 255.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 472, in main
raise RuntimeError(
RuntimeError: Using hostfile at hostfile but host=manager was not reachable via ssh. If you are running with a single node please remove hostfile or setup passwordless ssh.`

root@903c1e9c351c:/home/user/code# cat /etc/hosts 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 10.0.1.13 903c1e9c351c 10.0.1.13 manager 10.0.1.15 worker root@903c1e9c351c:/home/user/code# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 10.0.1.13 netmask 255.255.255.0 broadcast 10.0.1.255 ether 02:42:0a:00:01:0d txqueuelen 0 (Ethernet) RX packets 512 bytes 78880 (78.8 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 471 bytes 79480 (79.4 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

manager slots=1 worker slots=1