taoyang1122/adapt-image-models

An error with training on multiple GPUs

Opened this issue · 1 comments

Hello author,
Thanks for this research, I want to train vitlarge_clip_k400 on 4 V100
but when I use the command with "bash tools/dist_train.sh configs/recognition/vit/vitclip_large_k400.py 3 --test-last --validate --cfg-options model.backbone.pretrained=openaiclip work_dir=work_dirs_vit/k400_vitlarge/debug".
It get an error with below
/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

FutureWarning,
Traceback (most recent call last):
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/run.py", line 564, in determine_local_world_size
return int(nproc_per_node)
ValueError: invalid literal for int() with base 10: '--validate'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/run.py", line 709, in run
config, cmd, cmd_args = config_from_args(args)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/run.py", line 617, in config_from_args
nproc_per_node = determine_local_world_size(args.nproc_per_node)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/run.py", line 582, in determine_local_world_size
raise ValueError(f"Unsupported nproc_per_node value: {nproc_per_node}")
ValueError: Unsupported nproc_per_node value: --validate

The command looks good to me. I am not sure why it is interpreting --validate as the number_of_gpus. Could you directly try the provided script run_exp.sh ?