ymcui/Chinese-LLaMA-Alpaca-3

MacOS(苹果M3芯片)下指令精调报错

yaoyonstudio opened this issue · 3 comments

提交前必须检查以下项目

  • 请确保使用的是仓库最新代码(git pull)
  • 已阅读项目文档FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案。
  • 第三方插件问题:例如llama.cpptext-generation-webui等,建议优先去对应的项目中查找解决方案。

问题类型

模型训练与精调

基础模型

Llama-3-Chinese-8B-Instruct(指令模型)

操作系统

macOS

详细描述问题

MacOS(苹果M3芯片)下指令精调报错

依赖情况(代码类问题务必提供)

#!/bin/bash
## 运行脚本前请仔细阅读wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/wiki/sft_scripts_zh)
## Read the wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/wiki/sft_scripts_en) carefully before running the script
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/Users/ken/Devworld/ai/models/hfl/llama-3-chinese-8b-instruct-v2
tokenizer_name_or_path=${pretrained_model}
dataset_dir=/Users/ken/Devworld/ai/llama3-chinese/Chinese-LLaMA-Alpaca-3/data
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=/Users/ken/Devworld/ai/models/hfl/llama-3-chinese-8b-instruct-v2
validation_file=/Users/ken/Devworld/ai/llama3-chinese/Chinese-LLaMA-Alpaca-3/data/ost.json

torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${tokenizer_name_or_path} \
    --dataset_dir ${dataset_dir} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --low_cpu_mem_usage \
    --do_eval \
    --seed $RANDOM \
    --bf16 \
    --num_train_epochs 3 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --modules_to_save ${modules_to_save} \
    --torch_dtype bfloat16 \
    --validation_file ${validation_file} \
    --load_in_kbits 8 \
    --ddp_find_unused_parameters False

运行日志或截图

/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/accelerate/state.py:313: UserWarning: OMP_NUM_THREADS/MKL_NUM_THREADS unset, we set it at 14 to improve oob performance.
  warnings.warn(
Traceback (most recent call last):
  File "/Users/ken/Devworld/ai/llama3-chinese/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 457, in <module>
    main()
  File "/Users/ken/Devworld/ai/llama3-chinese/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 219, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 136, in __init__
  File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/transformers/training_args.py", line 1629, in __post_init__
    raise ValueError(
ValueError: BF16 Mixed precision training with AMP (`--bf16`) and BF16 half precision evaluation (`--bf16_full_eval`) can only be used on CUDA, XPU (with IPEX), NPU, MLU or CPU/TPU/NeuronCore devices.
[2024-05-28 17:00:42,881] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 38594) of binary: /Users/ken/anaconda3/envs/python-311/bin/python
Traceback (most recent call last):
  File "/Users/ken/anaconda3/envs/python-311/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-28_17:00:42
  host      : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 38594)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

这个代码主要是用在NV GPU上的。苹果M芯片需要将device改为MPS才有可能跑起来(可能还会报其他错误)。
你可以考虑使用苹果的MLX框架训练llama3模型,我们的模型就是标准的llama3模型。

ValueError: BF16 Mixed precision training with AMP (`--bf16`) and BF16 half precision evaluation (`--bf16_full_eval`) can only be used on CUDA, XPU (with IPEX), NPU, MLU or CPU/TPU/NeuronCore devices.

这个报错把问题说得很明白了,我也去查了一下, Apple Silicon 芯片目前(直到M3一代)都是不支持BF16格式的,问题出在MPS也就是GPU,而非神经网络引擎ANE/NPU或矩阵加速单元AMX上,原因可能是Apple使用了较老的arm指令集,而老版本中对BF16的支持是可选的而非强制性的。

我猜测可能的解决方案:

  • 改用FP16进行量化
  • 采用全精度进行训练,但是M3芯片不一定能跑起来,需要统一内存更大的芯片,比如M2 Ultra 192GM3 Max 128G

我使用MLX框架尝试一下,谢谢。