Colab中微调报错: CUDA out of memory
chenmonster opened this issue · 11 comments
提交前必须检查以下项目
- 请确保使用的是仓库最新代码(git pull)
- 已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案。
- 第三方插件问题:例如llama.cpp、text-generation-webui等,建议优先去对应的项目中查找解决方案。
问题类型
模型训练与精调
基础模型
Llama-3-Chinese-8B-Instruct(指令模型)
操作系统
Linux
详细描述问题
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05
pretrained_model=hfl/llama-3-chinese-8b-instruct-v2
tokenizer_name_or_path=${pretrained_model}
dataset_dir=./datasets--kigner--ruozhiba-llama3-tt/snapshots/2400d68db1bed109395e7470a6d9910581b21200
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=output_dir
validation_file=validation_file_name
torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py \
--model_name_or_path ${pretrained_model} \
--tokenizer_name_or_path ${tokenizer_name_or_path} \
--dataset_dir ${dataset_dir} \
--per_device_train_batch_size ${per_device_train_batch_size} \
--per_device_eval_batch_size ${per_device_eval_batch_size} \
--do_train \
--low_cpu_mem_usage \
--seed $RANDOM \
--num_train_epochs 3 \
--lr_scheduler_type cosine \
--learning_rate ${lr} \
--warmup_ratio 0.03 \
--logging_strategy steps \
--logging_steps 10 \
--save_strategy steps \
--save_total_limit 3 \
--evaluation_strategy steps \
--eval_steps 100 \
--save_steps 200 \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--preprocessing_num_workers 8 \
--max_seq_length ${max_seq_length} \
--output_dir ${output_dir} \
--overwrite_output_dir \
--ddp_timeout 30000 \
--logging_first_step True \
--lora_rank ${lora_rank} \
--lora_alpha ${lora_alpha} \
--trainable ${lora_trainable} \
--lora_dropout ${lora_dropout} \
--modules_to_save ${modules_to_save} \
--torch_dtype float16 \
--load_in_kbits 4 \
--ddp_find_unused_parameters False
依赖情况(代码类问题务必提供)
bitsandbytes 0.43.1
peft 0.7.1
sentencepiece 0.1.99
torch 2.2.1+cu121
torchaudio 2.2.1+cu121
torchdata 0.7.1
torchsummary 1.5.1
torchtext 0.17.1
torchvision 0.17.1+cu121
transformers 4.40.2
运行日志或截图
[INFO|modeling_utils.py:4178] 2024-05-16 03:58:24,735 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at hfl/llama-3-chinese-8b-instruct-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:883] 2024-05-16 03:58:24,835 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--hfl--llama-3-chinese-8b-instruct-v2/snapshots/15cfcd776b55047b601bf6635052f059ca754ded/generation_config.json
[INFO|configuration_utils.py:928] 2024-05-16 03:58:24,835 >> Generate config GenerationConfig {
"bos_token_id": 128000,
"do_sample": true,
"eos_token_id": [
128001,
128009
],
"max_length": 4096,
"temperature": 0.6,
"top_p": 0.9
}
05/16/2024 03:58:25 - INFO - __main__ - Model vocab size: 128256
05/16/2024 03:58:25 - INFO - __main__ - len(tokenizer):128256
05/16/2024 03:58:25 - INFO - __main__ - Init new peft model
05/16/2024 03:58:25 - INFO - __main__ - target_modules: ['q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj']
05/16/2024 03:58:25 - INFO - __main__ - lora_rank: 64
Traceback (most recent call last):
File "/content/run_clm_sft_with_peft.py", line 439, in <module>
main()
File "/content/run_clm_sft_with_peft.py", line 391, in main
model = get_peft_model(model, peft_config)
File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 133, in get_peft_model
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1043, in __init__
super().__init__(model, peft_config, adapter_name)
File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 126, in __init__
self.set_additional_trainable_modules(peft_config, adapter_name)
File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 631, in set_additional_trainable_modules
_set_trainable(self, adapter_name)
File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 276, in _set_trainable
target.update(adapter_name)
File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 190, in update
self.modules_to_save.update(torch.nn.ModuleDict({adapter_name: copy.deepcopy(self.original_module)}))
File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.10/copy.py", line 297, in _reconstruct
value = deepcopy(value, memo)
File "/usr/lib/python3.10/copy.py", line 153, in deepcopy
y = copier(memo)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parameter.py", line 59, in __deepcopy__
result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 14.75 GiB of which 695.06 MiB is free. Process 153578 has 14.07 GiB memory in use. Of the allocated memory 13.89 GiB is allocated by PyTorch, and 64.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-05-16 03:58:32,295] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 12197) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-16_03:58:32
host : 9af8a5d71495
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 12197)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
什么GPU OOM了?
什么GPU OOM了?
Tesla T4
你的运行脚本里modules_to_save="embed_tokens,lm_head"
,这两个部分不是LoRA训练。
可以考虑设置为None,看看是否能训起来。
你的运行脚本里
modules_to_save="embed_tokens,lm_head"
,这两个部分不是LoRA训练。 可以考虑设置为None,看看是否能训起来。
去掉这个参数这个还是这个错误
你重新加载运行时了吗?确保显卡清空RAM之后再运行。
昨天用colab T4都能跑通的(modules_to_save=None
),你自己再检查一下吧。
或者你用其他兼容llama-3训练精调的工具也都可以。
在 AutoDL AI算力云 上使用 V100-32GB 的显卡,可以正常运行。
训练后怎么转成GGUF
格式的模型呢?
当更改默认系统提示词 DEFAULT_SYSTEM_PROMPT
,将其内容变多后,训练时也会报 CUDA out of memory
错误。
colab T4 可以--load_in_kbits 8
吗 ? 内存会不够吗
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.