seba-1511/dist_tuto.pth

How to launch a job in a cluster?

Closed this issue · 3 comments

environment:TeslaK20 cluster,two nodes,every node has two gpus.

Process Process-1:
Traceback (most recent call last):
File "/home/zhangzhaoyu/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/zhangzhaoyu/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "train_dist.py", line 134, in init_processes
dist.init_process_group(backend, rank=rank, world_size=size)
File "/home/zhangzhaoyu/anaconda2/lib/python2.7/site-packages/torch/distributed/init.py", line 49, in init_process_group
group_name, rank)
RuntimeError: Address already in use at /pytorch/torch/lib/THD/process_group/General.cpp:17

Can you provide the script that you are running, and how you are running it on both nodes ?

I just run python train_dist.py.When I run it in the local,it's ok.However,when I submit it to a cluaster,it did not work at all.The only difference is os.environ['MASTER_ADDR'] = '172.16.1.185'.This is my address.
This is the result of top -b -n 2 > top.log.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2753 zhangzha 20 0 86.4g 83m 4468 R 100.0 0.1 1170:32 python
4148 zhangzha 20 0 86.5g 84m 4836 R 100.0 0.1 1150:54 python
4149 zhangzha 20 0 86.4g 84m 4468 R 100.0 0.1 1150:52 python
6167 zhangzha 20 0 86.4g 83m 4464 R 100.0 0.1 1112:46 python
1533 zhangzha 20 0 86.5g 84m 4836 R 99.0 0.1 1194:02 python
1534 zhangzha 20 0 86.4g 83m 4468 R 99.0 0.1 1194:01 python
2752 zhangzha 20 0 86.5g 84m 4836 R 99.0 0.1 1170:33 python
6166 zhangzha 20 0 86.5g 84m 4832 R 99.0 0.1 1112:46 python
8879 zhangzha 20 0 86.8g 103m 10m R 99.0 0.2 1036:09 python
8880 zhangzha 20 0 86.8g 103m 9.8m R 99.0 0.2 1036:09 python
9083 zhangzha 20 0 86.8g 104m 10m R 99.0 0.2 1034:23 python
9084 zhangzha 20 0 86.8g 104m 9.8m R 99.0 0.2 1034:21 python
12969 zhangzha 20 0 86.8g 105m 10m R 99.0 0.2 922:36.59 python
12970 zhangzha 20 0 86.8g 104m 9.8m R 99.0 0.2 922:34.89 python

If the script is executed on both nodes, and you didn't comment what's below the if __name__ == __main__ line, then you're training with 4 replicas but only declared 2.

Fix it by spawning the right number of processes per node and initializing with the right world_size. (e.g., you could substitute size by 4 on line 133.)