the .sh script of Evaluate the fine-tuned EVA (336px, patch_size=14) on ImageNet-1K val with a single node (click to expand) can not execute.

Question

the .sh script of Evaluate the fine-tuned EVA (336px, patch_size=14) on ImageNet-1K val with a single node (click to expand) can not execute.

peter-ni-noob opened this issue a year ago · 1 comments

(eva) root@nexus-nyz:~/EVA/EVA-01/eva# bash eva.sh
/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Traceback (most recent call last):
File "/root/miniconda3/envs/eva/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/eva/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 212, in launch_agent
master_addr, master_port = _get_addr_and_port(rdzv_parameters)
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 167, in _get_addr_and_port
master_addr, master_port = parse_rendezvous_endpoint(endpoint, default_port=-1)
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 102, in parse_rendezvous_endpoint
raise ValueError(
ValueError: The hostname of the rendezvous endpoint ':12355' must be a dot-separated list of labels, an IPv4 address, or an IPv6 address.

Answer 1 · 2023-09-22T05:07:21.000Z

MODEL_NAME=eva_g_patch14

sz=336
batch_size=16
crop_pct=1.0

EVAL_CKPT=/path/to/eva_21k_1k_336px_psz14_ema_89p6.pt # https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_336px_psz14_ema_89p6.pt

DATA_PATH=/data_gs/imagenet
NNODES=1
NODE_RANK=0
MASTER_ADDR=127.0.0.1
python -m torch.distributed.launch --nproc_per_node=7 --nnodes=$NNODES --node_rank=$NODE_RANK
--master_addr=$MASTER_ADDR --master_port=12355 --use_env run_class_finetuning.py
--data_path ${DATA_PATH}/train
--eval_data_path ${DATA_PATH}/val
--nb_classes 1000
--data_set image_folder
--model ${MODEL_NAME}
--finetune ${EVAL_CKPT}
--input_size ${sz}
--batch_size ${batch_size}
--crop_pct ${crop_pct}
--no_auto_resume
--dist_eval
--eval
--enable_deepspeed
the code above is eva.sh,I have 1 machine with 7 gpus,that's how i config.but now turn out bugs(RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.)