单机多卡并行训练报错

Question

单机多卡并行训练报错

Closed this issue a year ago · 3 comments

错误：
ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.
训练脚本：
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 train_lora.py \ --model_name_or_path ./models/baichuan-13B-Base \ --dataset_name alpaca \ --data_dir data/alpaca.json \ --load_from_local \ --output_dir ./work_dir/baichuan-13b-wb-lora-ds \ --lora_target_modules W_pack \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_total_limit 20 \ --save_steps 500 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --source_max_len 512 \ --target_max_len 512 \ --lora_r 16 \ --lora_alpha 16 \ --lora_dropout 0.1 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --trust_remote_code \ --fp16 \ --deepspeed "scripts/ds_config/ds_config_zero3_auto.json"

Answer 1 · 2023-07-21T09:19:05.000Z

注释掉device_map 可以跑起来，但在200步之后又报了另一个错误
144 model = AutoModelForCausalLM.from_pretrained( 145 args.model_name_or_path, 146 #device_map=device_map, #注释这里 147 quantization_config=BitsAndBytesConfig( 148 load_in_4bit=True, 149 llm_int8_threshold=6.0, 150 llm_int8_has_fp16_weight=False, 151 bnb_4bit_use_double_quant=True, 152 bnb_4bit_quant_type='nf4', 153 bnb_4bit_compute_dtype=compute_dtype, 154 ) if args.q_lora else None, 155 torch_dtype=compute_dtype, 156 **config_kwargs, 157 )

新的错误：torch.distributed.elastic.multiprocessing.api.SignalException: Process 258303 got signal: 1

Answer 2 · 2023-07-21T10:33:15.000Z

似乎是 deepspeed 的bug

Answer 3 · 2023-07-24T04:10:48.000Z

似乎是 deepspeed 的bug

找到问题了，和deepspeed版本有关。我实验成功的版本：
accelerate 0.20.3
deepspeed 0.9.2