subprocess.CalledProcessError: Command xxx returned non-zero exit status 1.
gm0616 opened this issue · 3 comments
When I try training on VCR dataset with the comand ./scripts/dist_run_single.sh 1 vcr/train_end2end.py ./cfgs/vcr/base_q2a_4x16G_fp32.yaml ./
, I got an error like this:
Traceback (most recent call last):
File "vcr/train_end2end.py", line 59, in <module>
main()
File "vcr/train_end2end.py", line 53, in main
rank, model = train_net(args, config)
File "/gruntdata/guimin.gm/vlbert/vcr/../vcr/function/train.py", line 87, in train_net
group_name='mtorch')
File "/home/guimin.gm/miniconda3/envs/pt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/home/guimin.gm/miniconda3/envs/pt/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 95, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: Address already in use
Traceback (most recent call last):
File "./scripts/launch.py", line 200, in <module>
main()
File "./scripts/launch.py", line 196, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/guimin.gm/miniconda3/envs/pt/bin/python', '-u', 'vcr/train_end2end.py', '--cfg', './cfgs/vcr/base_q2a_4x16G_fp32.yaml', '--model-dir', './', '--dist']' returned non-zero exit status 1.
I haven`t found the solution, is there anybody can help me? Thanks a lot.
the same problem happen when try Distributed Training.
why not try Non-Distributed Training?
It seems the port has been used by other programs. Could you try to modify the port in:
Line 134 in 4373674
Can anyone help me how to solve this while run training I got this error?
AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.6 -m launch_ddp --config configs/dist-training-config.yaml" Traceback (most recent call last): File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/ml/code/launch_ddp.py", line 42, in raise subprocess.CalledProcessError(returncode=process.returncode, cmd=joint_cmd) subprocess.CalledProcessError: Command 'python -m torch.distributed.launch --nnodes 1 --node_rank 0 --nproc_per_node 1 --master_addr algo-1 --master_port 55555 /opt/ml/code/train.py --config configs/dist-training-config.yaml' returned non-zero exit status 1., exit code: 1