nccl 报错了

Question

nccl 报错了

Closed this issue a year ago · 3 comments

这是我的命令accelerate launch --config_file accelerate_ds_config.yaml mft_accelerate.py --train_config configs/llama_train_config.json，
这是我的报错：
build_tokenizer PAD id: 0, EOD id: 2
build_tokenizer PAD token : , EOD token:

padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
['/home/project/MFTCoder/mft_peft_hf/src/pefts/data']
data splits: [95.0, 5.0, 0.0]
[Global Rank 0] open file /home/project/MFTCoder/mft_peft_hf/src/pefts/data/CodeExercise-Python-27k.json
[Global Rank 0]shape of cur train dataset: (4065, 4097)
[Global Rank 0]shape of cur valid dataset: (213, 4097)
[Global Rank 0]num tokens: [14585319]
[Global Rank 0]effective token rate: [0.8323676709874649]
train loss weights in rank 0: [1.0]
valid loss weights in rank 0: [1.0]
common denomination factor for CE loss in rank 0: 1
train sample weights in rank 0: [1.0]
valid sample weights in rank 0: [1.0]
Traceback (most recent call last):
File "/home/project/MFTCoder/mft_peft_hf/src/pefts/mft_accelerate.py", line 385, in
main()
File "/home/project/MFTCoder/mft_peft_hf/src/pefts/mft_accelerate.py", line 249, in main
train_dataset, valid_dataset = load_dataset_from_jsonl(args, shard_data=True, world_size=args.world_size,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/project/MFTCoder/mft_peft_hf/src/pefts/../data/gpt2_multi_task_dataset.py", line 318, in load_dataset_from_jsonl
torch.distributed.barrier()
File "/home/miniconda3/envs/aps/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
work = default_pg.barrier(opts=opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1682343995622/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Bootstrap : no socket interface found
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3507579) of binary: /home/miniconda3/envs/aps/bin/python
Traceback (most recent call last):
File "/home/lqqq/.local/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/lqqq/.local/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/lqqq/.local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 964, in launch_command
deepspeed_launcher(args)
File "/home/lqqq/.local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
distrib_run.run(args)
File "/home/miniconda3/envs/aps/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/miniconda3/envs/aps/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/miniconda3/envs/aps/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
mft_accelerate.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-12-25_14:08:59
host : ks-gpu-7
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3507579)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

这是我的参数设置：
accelerate_ds_config.yaml：
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: none
zero3_init_flag: false
zero3_save_16bit_model: true
zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false
++++++++++++++++++++++
llama_train_config.json:
{
"load_raw_dataset": true,
"data_paths": " /slurmhome/aps/libianbian/MFTCoder/mft_peft_hf/src/pefts/data ",
"output_dir": "./output_dir",
"tb_dir": "./tb_dir",
"pretrained_model_path": "/slurmhome/aps/pretrain_models/llama-7b-hf",
"vocab_file": "/slurmhome/aps/pretrain_models/llama-7b-hf",
"low_cpu_mem_usage": true,
"data_split": "95,5,0",
"padding_mode": "pack",
"tokenize_mode": "sft",
"weighted_loss_mode": "case3",
"model_type": "llama",
"peft_type": "qlora",
"quantization": "4bit",
"lora_rank": 32,
"lora_alpha": 32,
"lora_dropout": 0.05,
"per_device_train_batch_size": 2,
"per_device_eval_batch_size": 2,
"tokenizer_type": "AutoTokenizer",
"learning_rate": 1e-04,
"min_lr": 1e-5,
"weight_decay": 0.1,
"gradient_accumulation_steps": 1,
"lr_scheduler_type": "cosine",
"num_warmup_steps": 300,
"num_train_epochs": 8,
"seed": 1234,
"seq_length": 4096,
"resume_from_checkpoint": null,
"log_interval": 10,
"checkpointing_steps": 1000,
"evalation_steps": 1000,
"max_train_steps": null,
"epoch_checkpointing": true,
"shuffle_before_split": true,
"use_random_sampler": true,
"early_stopping": true,
"early_stopping_stall_num": 5,
"weight_by_num_documents": true,
"make_vocab_size_divisible_by": 128,
"model_parallel_size": 1,
"use_slow_tokenizer": false,
"use_xformers": true,
"trust_remote_code": true,
"use_dynamic_padding": true,
"world_size": 8
}

我用的是单机 8 卡训练的

Answer 1 · 2023-12-26T05:45:31.000Z

accelerate_ds_config.yaml：

num_machines: 1
num_processes: 1

你是单机8卡训练，这个地方需要修改为num_processes: 8

Answer 2 · 2023-12-28T08:12:22.000Z

File "/slurmhome/aps/libianbian/Coder/mft_peft_hf/src/pefts/../data/gpt2_multi_task_dataset.py", line 318, in load_dataset_from_jsonl
torch.distributed.barrier()
File "/slurmhome/aps/miniconda3/envs/aps/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
work = default_pg.barrier(opts=opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1682343995622/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Bootstrap : no socket interface found
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3710366 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3710367 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3710368 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3710369 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3710370 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3710371 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3710372 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3710365) of binary: /slurmhome/aps/miniconda3/envs/aps/bin/python

修改了上述参数之后还是报的这个错误

Answer 3 · 2023-12-29T03:48:08.000Z

The error message indicates that the NCCL backend used by PyTorch for distributed training has encountered an internal error during the initialization phase. The specific error is Bootstrap : no socket interface found, which suggests that NCCL was unable to find a suitable network interface for communication.

However, I'm not very familiar with underlying NCCL. I'm sorry for this. Here is some suggestions from GPT-4, hope it's helpful to you.

Set NCCL_SOCKET_IFNAME: Explicitly specify the network interface that NCCL should use by setting the NCCL_SOCKET_IFNAME environment variable. You need to set it to the name of the network interface you want to use for GPU communication. You can find available network interfaces by running ifconfig or ip addr on the command line.
Check for NCCL Compatibility: Make sure you are using a version of NCCL that is compatible with your version of PyTorch.
Update Software: Updating the versions of PyTorch, NCCL, and CUDA might resolve compatibility and bug issues.
Enable NCCL Debugging: Increase the verbosity of NCCL's logs to get more detailed information about the problem by setting NCCL_DEBUG=INFO or NCCL_DEBUG=WARN.
Check the Network Environment: Ensure that all nodes in your distributed setting are on the same network and that the network configuration allows them to communicate without issues.
Check Firewall and Security Groups: Make sure that any firewalls or security groups are configured to allow traffic on the ports that NCCL uses.
Use a Different Backend: If you continue to have issues with NCCL, you can try using a different backend for distributed training, such as Gloo, though it might not offer the same performance benefits as NCCL for GPU-based training.

You can first check if NCCL and PyTorch are working properly. You can search a simple DDP program and try running to check it.

Failures: <NO_OTHER_FAILURES>

Failures:
<NO_OTHER_FAILURES>