PaddlePaddle/PaddleNLP

[Question]: uie-base模型在昇腾服务器上训练错误

Closed this issue · 1 comments

请提出你的问题

paddle框架编译npu版本check成功:
FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')

I0430 15:37:53.522773 32875 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60423
I0430 15:38:17.834956 32959 tcp_store.cc:293] receive shutdown event and so quit from MasterDaemon run loop
PaddlePaddle works well on 8 npus.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

**代码分支:**develop
**paddle框架编译的docker镜像:**registry.baidubce.com/device/paddle-npu:cann80T2-910B-ubuntu18-aarch64
npu-info:
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.0 Version: 23.0.0 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B3 | OK | 94.4 39 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3315 / 65536 |
+===========================+===============+====================================================+
| 1 910B3 | OK | 91.6 37 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 3315 / 65536 |
+===========================+===============+====================================================+
| 2 910B3 | OK | 92.3 38 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 3315 / 65536 |
+===========================+===============+====================================================+
| 3 910B3 | OK | 92.6 39 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 3315 / 65536 |

模型训练错误日志:
Traceback (most recent call last):
File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 262, in
main()
File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 193, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 888, in train
self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 1024, in maybe_log_save_evaluate
tr_loss_scalar = self.nested_gather(tr_loss).mean().item()
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 2544, in nested_gather
tensors = distributed_concat(tensors)
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in distributed_concat
output_tensors = [t if len(t.shape) > 0 else t.reshape
([-1]) for t in output_tensors]
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in
output_tensors = [t if len(t.shape) > 0 else t.reshape
([-1]) for t in output_tensors]
File "/opt/py39/lib/python3.9/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/py39/lib/python3.9/site-packages/paddle/base/wrapped_decorator.py", line 26, in impl
return wrapped_func(*args, **kwargs)
File "/opt/py39/lib/python3.9/site-packages/paddle/utils/inplace_utils.py", line 45, in impl
return func(*args, **kwargs)
File "/opt/py39/lib/python3.9/site-packages/paddle/tensor/manipulation.py", line 4635, in reshape

out = C_ops.reshape(x, shape)
OSError: (External) ACL error, the error code is : 100000. (at /work/PaddleCustomDevice/backends/npu/kernels/funcs/npu_op_runner.cc:223)

启动脚本:
#/bin/bash
export finetuned_model=./checkpoint/model_best
nohup python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune.py --device gpu --logging_steps 10 --save_steps 100 --eval_steps 100 --seed 42 --model_name_or_path uie-base --output_dir $finetuned_model --train_path data/train.txt --dev_path data/dev.txt --max_seq_length 512 --per_device_eval_batch_size 21 --per_device_train_batch_size 32 --num_train_epochs 50 --learning_rate 1e-2 --label_names "start_positions" "end_positions" --do_train --do_eval --do_export --export_model_dir $finetuned_model --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit 1 >nohup.out &