LARC-CMU-SMU/FoodSeg103-Benchmark-v1

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=${PORT:-300} tools/train.py --config configs/foodnet/SETR_Naive_768x768_80k_base_RM.py --work-dir checkpoints/SETR_Naive_ReLeM --launcher pytorch

Opened this issue · 3 comments

Again, I have a question about the process of train, when I use your guidline ,and there are some error:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last):
File "tools/train.py", line 167, in
main()
File "tools/train.py", line 98, in main
init_dist(args.launcher, **cfg.dist_params)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 20, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 34, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Permission denied
Traceback (most recent call last):
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in
main()
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/dongxiaoxiao/anaconda3/envs/open-mmlab/bin/python', '-u', 'tools/train.py', '--local_rank=3', '--config', 'configs/foodnet/fpn_r50_512x1024_80k_RM.py', '--work-dir', 'checkpoints/FPN_r50_RM', '--launcher', 'pytorch']' returned non-zero exit status 1.

And I remember that about 20 days ago,I saw some said "--launcher pytorch" may cause some questions, but I don`t know, hope your reply ,Thanks a lot!

@Mark1Dong sorry for replying late since I am super busy recently. Can you first paste your environment (OS, GPU etc.) and I can check it in more details?

Thanks for your reply, and I have solved this question. The main problem is the command-line format

also , for the port, '-300' is not suitable, so I change the port to ' -36900' ,and the question is solved