open-mmlab/mmengine

[Bug] 分布式训练代码例子报错,

Closed this issue · 5 comments

Prerequisite

Environment

pytorch 2.3 cuda 12.3 gpu train

Reproduces the problem - code sample

https://github.com/open-mmlab/mmengine/blob/main/examples/llama2/fsdp_finetune.py
修改为训练 书生模型
# Prepare model for internlm2 by wuzhhui
model, tokenizer = build_model(
model_name_or_path=args.checkpoint,
return_tokenizer=True)

# Prepare model for llama
#tokenizer = LlamaTokenizer.from_pretrained(args.checkpoint)
#tokenizer.add_special_tokens({'pad_token': '<PAD>'})
#model = LlamaForCausalLM.from_pretrained(args.checkpoint)

Reproduces the problem - command or script

LOGLEVEL=DEBUG NPROC_PER_NODE=1 torchrun fsdp_finetune.py /models/instruct-finetrain.json /models/internlm2-1_8b --max-epoch 100 --save-interval 50 --output-dir ${work_dir}

Reproduces the problem - error message

RuntimeError: "amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
[rank0]: Traceback (most recent call last):
[rank0]: File "/models/internlm2-1_8b_fsdp_train/fsdp_finetune.py", line 185, in
[rank0]: train()
[rank0]: File "/models/internlm2-1_8b_fsdp_train/fsdp_finetune.py", line 161, in train
[rank0]: optimizer.update_params(loss)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 201, in update_params
[rank0]: self.step(**step_kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
[rank0]: return wrapped(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/optimizer/amp_optimizer_wrapper.py", line 137, in step
[rank0]: self.loss_scaler.unscale
(self.optimizer)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/sharded_grad_scaler.py", line 278, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/sharded_grad_scaler.py", line 243, in unscale_grads
[rank0]: torch.amp_foreach_non_finite_check_and_unscale(
[rank0]: RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'

Additional information

maybe model or lib

请问你的显卡是什么型号

有可能是你的显卡不支持 Bfloat16 计算

另外,如果你想要微调 InternLM 模型,推荐使用 XTuner (https://github.com/InternLM/xtuner)

GPU 1: Tesla V100-PCIE-32GB

GPU 1: Tesla V100-PCIE-32GB

V100 应该是不支持 Bfloat16 的