MacOS(苹果M3芯片)下指令精调报错
yaoyonstudio opened this issue · 3 comments
yaoyonstudio commented
提交前必须检查以下项目
- 请确保使用的是仓库最新代码(git pull)
- 已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案。
- 第三方插件问题:例如llama.cpp、text-generation-webui等,建议优先去对应的项目中查找解决方案。
问题类型
模型训练与精调
基础模型
Llama-3-Chinese-8B-Instruct(指令模型)
操作系统
macOS
详细描述问题
MacOS(苹果M3芯片)下指令精调报错
依赖情况(代码类问题务必提供)
#!/bin/bash
## 运行脚本前请仔细阅读wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/wiki/sft_scripts_zh)
## Read the wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/wiki/sft_scripts_en) carefully before running the script
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05
pretrained_model=/Users/ken/Devworld/ai/models/hfl/llama-3-chinese-8b-instruct-v2
tokenizer_name_or_path=${pretrained_model}
dataset_dir=/Users/ken/Devworld/ai/llama3-chinese/Chinese-LLaMA-Alpaca-3/data
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=/Users/ken/Devworld/ai/models/hfl/llama-3-chinese-8b-instruct-v2
validation_file=/Users/ken/Devworld/ai/llama3-chinese/Chinese-LLaMA-Alpaca-3/data/ost.json
torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py \
--model_name_or_path ${pretrained_model} \
--tokenizer_name_or_path ${tokenizer_name_or_path} \
--dataset_dir ${dataset_dir} \
--per_device_train_batch_size ${per_device_train_batch_size} \
--per_device_eval_batch_size ${per_device_eval_batch_size} \
--do_train \
--low_cpu_mem_usage \
--do_eval \
--seed $RANDOM \
--bf16 \
--num_train_epochs 3 \
--lr_scheduler_type cosine \
--learning_rate ${lr} \
--warmup_ratio 0.03 \
--logging_strategy steps \
--logging_steps 10 \
--save_strategy steps \
--save_total_limit 3 \
--evaluation_strategy steps \
--eval_steps 100 \
--save_steps 200 \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--preprocessing_num_workers 8 \
--max_seq_length ${max_seq_length} \
--output_dir ${output_dir} \
--overwrite_output_dir \
--ddp_timeout 30000 \
--logging_first_step True \
--lora_rank ${lora_rank} \
--lora_alpha ${lora_alpha} \
--trainable ${lora_trainable} \
--lora_dropout ${lora_dropout} \
--modules_to_save ${modules_to_save} \
--torch_dtype bfloat16 \
--validation_file ${validation_file} \
--load_in_kbits 8 \
--ddp_find_unused_parameters False
运行日志或截图
/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/accelerate/state.py:313: UserWarning: OMP_NUM_THREADS/MKL_NUM_THREADS unset, we set it at 14 to improve oob performance.
warnings.warn(
Traceback (most recent call last):
File "/Users/ken/Devworld/ai/llama3-chinese/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 457, in <module>
main()
File "/Users/ken/Devworld/ai/llama3-chinese/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 219, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
^^^^^^^^^^^^^^^
File "<string>", line 136, in __init__
File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/transformers/training_args.py", line 1629, in __post_init__
raise ValueError(
ValueError: BF16 Mixed precision training with AMP (`--bf16`) and BF16 half precision evaluation (`--bf16_full_eval`) can only be used on CUDA, XPU (with IPEX), NPU, MLU or CPU/TPU/NeuronCore devices.
[2024-05-28 17:00:42,881] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 38594) of binary: /Users/ken/anaconda3/envs/python-311/bin/python
Traceback (most recent call last):
File "/Users/ken/anaconda3/envs/python-311/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ken/anaconda3/envs/python-311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-28_17:00:42
host : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 38594)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
ymcui commented
这个代码主要是用在NV GPU上的。苹果M芯片需要将device改为MPS才有可能跑起来(可能还会报其他错误)。
你可以考虑使用苹果的MLX框架训练llama3模型,我们的模型就是标准的llama3模型。
AdrianusWei commented
ValueError: BF16 Mixed precision training with AMP (`--bf16`) and BF16 half precision evaluation (`--bf16_full_eval`) can only be used on CUDA, XPU (with IPEX), NPU, MLU or CPU/TPU/NeuronCore devices.
这个报错把问题说得很明白了,我也去查了一下, Apple Silicon 芯片目前(直到M3一代)都是不支持BF16格式的,问题出在MPS也就是GPU,而非神经网络引擎ANE/NPU
或矩阵加速单元AMX
上,原因可能是Apple使用了较老的arm指令集,而老版本中对BF16的支持是可选的而非强制性的。
我猜测可能的解决方案:
- 改用FP16进行量化
- 采用全精度进行训练,但是M3芯片不一定能跑起来,需要统一内存更大的芯片,比如
M2 Ultra 192G
、M3 Max 128G
yaoyonstudio commented
我使用MLX框架尝试一下,谢谢。