PaddlePaddle/PaddleNLP

[Bug]: get_rank_by_dim_and_process_id 函数未实现

jazzly opened this issue · 0 comments

软件环境

- paddlepaddle: 2.6.1
- paddlepaddle-gpu: 
- paddlenlp: 2.8.0

重复问题

  • I have searched the existing issues

错误描述

使用如上版本训练 https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme
示例中的训练数据时,使用CPU模式时由于默认命令只使用单线程训练。想加快训练进程,查看了有一个enable_auto_parallel参数,当把这个 enable_auto_parallel 置为True时,启动训练会报get_rank_by_dim_and_process_id 函数找不到。

Traceback (most recent call last):
  File "train.py", line 230, in <module>
    main()
  File "train.py", line 166, in main
    trainer = Trainer(
  File "/home/user/.local/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 388, in __init__
    self.print_config()
  File "/home/user/.local/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 3058, in print_config
    v = getattr(args, a)
  File "/home/user/.local/lib/python3.8/site-packages/paddlenlp/trainer/training_args.py", line 1524, in data_parallel_rank
    return mesh.get_rank_by_dim_and_process_id("dp", dist.get_rank())
AttributeError: 'ProcessMesh' object has no attribute 'get_rank_by_dim_and_process_id'

稳定复现步骤 & 代码

训练时启用 enable_auto_parallel参数
python3 train.py
--do_train
--do_eval
--do_export
--model_name_or_path ernie-3.0-tiny-medium-v2-zh
--output_dir checkpoint
--device cpu
--num_train_epochs 100
--early_stopping True
--early_stopping_patience 5
--learning_rate 3e-5
--max_length 128
--per_device_eval_batch_size 32
--per_device_train_batch_size 32
--metric_for_best_model accuracy
--load_best_model_at_end
--logging_steps 5
--evaluation_strategy epoch
--save_strategy epoch
--save_total_limit 3
--enable_auto_parallel True