hiyouga/LLaMA-Factory

ascend 910b,chatglm2做全量微调报错

Closed this issue · 6 comments

Reminder

  • I have read the README and searched the existing issues.

Reproduction

bug 如下图

Expected behavior

No response

System Info

No response

Others

image

[INFO|modeling_utils.py:4170] 2024-05-20 17:25:15,119 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[INFO|modeling_utils.py:4178] 2024-05-20 17:25:15,119 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at /root/.cache/modelscope/hub/ZhipuAI/chatglm3-6b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training.
[INFO|modeling_utils.py:3719] 2024-05-20 17:25:15,124 >> Generation config file not found, using a generation config created from the model config.
05/20/2024 17:25:15 - INFO - llamafactory.model.utils.checkpointing - Gradient checkpointing enabled.
05/20/2024 17:25:15 - INFO - llamafactory.model.utils.attention - Using vanilla attention implementation.
05/20/2024 17:25:15 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
05/20/2024 17:25:15 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
05/20/2024 17:25:15 - INFO - llamafactory.model.loader - trainable params: 1949696 || all params: 6245533696 || trainable%: 0.0312
[INFO|trainer.py:626] 2024-05-20 17:25:15,521 >> Using auto half precision backend
[INFO|trainer.py:2048] 2024-05-20 17:25:15,881 >> ***** Running training *****
[INFO|trainer.py:2049] 2024-05-20 17:25:15,881 >> Num examples = 1,000
[INFO|trainer.py:2050] 2024-05-20 17:25:15,881 >> Num Epochs = 3
[INFO|trainer.py:2051] 2024-05-20 17:25:15,882 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2054] 2024-05-20 17:25:15,882 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2055] 2024-05-20 17:25:15,882 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2056] 2024-05-20 17:25:15,882 >> Total optimization steps = 186
[INFO|trainer.py:2057] 2024-05-20 17:25:15,884 >> Number of trainable parameters = 1,949,696
0%| | 0/186 [00:00<?, ?it/s]Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/data/LLaMA-Factory/src/llamafactory/cli.py", line 65, in main
run_exp()
File "/data/LLaMA-Factory/src/llamafactory/train/tuner.py", line 34, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/data/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step
loss = self.compute_loss(model, inputs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3161, in compute_loss
outputs = model(**inputs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
return model_forward(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/peft_model.py", line 1129, in forward
return self.base_model(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
return self.model.forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 941, in forward
transformer_outputs = self.transformer(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 834, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 631, in forward
layer_ret = torch.utils.checkpoint.checkpoint(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint
ret = function(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 544, in forward
attention_output, kv_cache = self.self_attention(
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 408, in forward
query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
NotImplementedError: Unknown device for graph fuser

@hunterhome chatglm使用了torch.jit,torch-npu不支持,可以把对应的torch.jit装饰器注释掉

image

补充一点信息,不要在报错的 ~/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py 取消 torch.jit 装饰器注释;要在 modelscope 或 huggingface下载的 repo 里修改对应文件,比如modelscope的是 ~/.cache/modelscope/hub/ZhipuAI/chatglm3-6b/modeling_chatglm.py