[Bug] 分布式训练代码例子报错，

Question

[Bug] 分布式训练代码例子报错，

apachemycat opened this issue 2 months ago · 5 comments

apachemycat commented 2 months ago

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

pytorch 2.3 cuda 12.3 gpu train

Reproduces the problem - code sample

https://github.com/open-mmlab/mmengine/blob/main/examples/llama2/fsdp_finetune.py
修改为训练书生模型
# Prepare model for internlm2 by wuzhhui
model, tokenizer = build_model(
model_name_or_path=args.checkpoint,
return_tokenizer=True)

# Prepare model for llama
#tokenizer = LlamaTokenizer.from_pretrained(args.checkpoint)
#tokenizer.add_special_tokens({'pad_token': '<PAD>'})
#model = LlamaForCausalLM.from_pretrained(args.checkpoint)

Reproduces the problem - command or script

LOGLEVEL=DEBUG NPROC_PER_NODE=1 torchrun fsdp_finetune.py /models/instruct-finetrain.json /models/internlm2-1_8b --max-epoch 100 --save-interval 50 --output-dir ${work_dir}

Reproduces the problem - error message

RuntimeError: "amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
[rank0]: Traceback (most recent call last):
[rank0]: File "/models/internlm2-1_8b_fsdp_train/fsdp_finetune.py", line 185, in
[rank0]: train()
[rank0]: File "/models/internlm2-1_8b_fsdp_train/fsdp_finetune.py", line 161, in train
[rank0]: optimizer.update_params(loss)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 201, in update_params
[rank0]: self.step(**step_kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
[rank0]: return wrapped(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/optimizer/amp_optimizer_wrapper.py", line 137, in step
[rank0]: self.loss_scaler.unscale(self.optimizer)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/sharded_grad_scaler.py", line 278, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/sharded_grad_scaler.py", line 243, in unscale_grads
[rank0]: torch.amp_foreach_non_finite_check_and_unscale(
[rank0]: RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'

Additional information

maybe model or lib

Answer 1 · 2024-05-16T10:59:21.000Z

请问你的显卡是什么型号

Answer 2 · 2024-05-16T10:59:45.000Z

有可能是你的显卡不支持 Bfloat16 计算

Answer 3 · 2024-05-16T11:00:33.000Z

另外，如果你想要微调 InternLM 模型，推荐使用 XTuner (https://github.com/InternLM/xtuner)

Answer 4 · 2024-05-19T01:32:56.000Z

GPU 1: Tesla V100-PCIE-32GB

Answer 5 · 2024-05-20T05:57:06.000Z

GPU 1: Tesla V100-PCIE-32GB

V100 应该是不支持 Bfloat16 的