Colab中微调报错： CUDA out of memory

Question

Colab中微调报错： CUDA out of memory

chenmonster opened this issue 6 months ago · 11 comments

chenmonster commented 6 months ago

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull）
已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、text-generation-webui等，建议优先去对应的项目中查找解决方案。

问题类型

模型训练与精调

基础模型

Llama-3-Chinese-8B-Instruct（指令模型）

操作系统

Linux

详细描述问题

lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=hfl/llama-3-chinese-8b-instruct-v2
tokenizer_name_or_path=${pretrained_model}
dataset_dir=./datasets--kigner--ruozhiba-llama3-tt/snapshots/2400d68db1bed109395e7470a6d9910581b21200
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=output_dir
validation_file=validation_file_name

torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${tokenizer_name_or_path} \
    --dataset_dir ${dataset_dir} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --low_cpu_mem_usage \
    --seed $RANDOM \
    --num_train_epochs 3 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --modules_to_save ${modules_to_save} \
    --torch_dtype float16 \
    --load_in_kbits 4 \
    --ddp_find_unused_parameters False

依赖情况（代码类问题务必提供）

bitsandbytes                     0.43.1
peft                             0.7.1
sentencepiece                    0.1.99
torch                            2.2.1+cu121
torchaudio                       2.2.1+cu121
torchdata                        0.7.1
torchsummary                     1.5.1
torchtext                        0.17.1
torchvision                      0.17.1+cu121
transformers                     4.40.2

运行日志或截图

[INFO|modeling_utils.py:4178] 2024-05-16 03:58:24,735 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at hfl/llama-3-chinese-8b-instruct-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:883] 2024-05-16 03:58:24,835 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--hfl--llama-3-chinese-8b-instruct-v2/snapshots/15cfcd776b55047b601bf6635052f059ca754ded/generation_config.json
[INFO|configuration_utils.py:928] 2024-05-16 03:58:24,835 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128009
  ],
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}

05/16/2024 03:58:25 - INFO - __main__ - Model vocab size: 128256
05/16/2024 03:58:25 - INFO - __main__ - len(tokenizer):128256
05/16/2024 03:58:25 - INFO - __main__ - Init new peft model
05/16/2024 03:58:25 - INFO - __main__ - target_modules: ['q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj']
05/16/2024 03:58:25 - INFO - __main__ - lora_rank: 64
Traceback (most recent call last):
  File "/content/run_clm_sft_with_peft.py", line 439, in <module>
    main()
  File "/content/run_clm_sft_with_peft.py", line 391, in main
    model = get_peft_model(model, peft_config)
  File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 133, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1043, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 126, in __init__
    self.set_additional_trainable_modules(peft_config, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 631, in set_additional_trainable_modules
    _set_trainable(self, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 276, in _set_trainable
    target.update(adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 190, in update
    self.modules_to_save.update(torch.nn.ModuleDict({adapter_name: copy.deepcopy(self.original_module)}))
  File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.10/copy.py", line 297, in _reconstruct
    value = deepcopy(value, memo)
  File "/usr/lib/python3.10/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parameter.py", line 59, in __deepcopy__
    result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 14.75 GiB of which 695.06 MiB is free. Process 153578 has 14.07 GiB memory in use. Of the allocated memory 13.89 GiB is allocated by PyTorch, and 64.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-05-16 03:58:32,295] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 12197) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-16_03:58:32
  host      : 9af8a5d71495
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 12197)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Answer 1 · 2024-05-16T04:44:10.000Z

什么GPU OOM了？

Answer 2 · 2024-05-16T06:37:03.000Z

什么GPU OOM了？

Tesla T4

Answer 3 · 2024-05-16T07:09:08.000Z

你的运行脚本里modules_to_save="embed_tokens,lm_head"，这两个部分不是LoRA训练。
可以考虑设置为None，看看是否能训起来。

Answer 4 · 2024-05-17T03:26:14.000Z

你的运行脚本里modules_to_save="embed_tokens,lm_head"，这两个部分不是LoRA训练。可以考虑设置为None，看看是否能训起来。

去掉这个参数这个还是这个错误

Answer 5 · 2024-05-17T03:30:52.000Z

你重新加载运行时了吗？确保显卡清空RAM之后再运行。
昨天用colab T4都能跑通的（modules_to_save=None），你自己再检查一下吧。
或者你用其他兼容llama-3训练精调的工具也都可以。

Answer 6 · 2024-05-22T11:23:23.000Z

在 AutoDL AI算力云上使用 V100-32GB 的显卡，可以正常运行。
训练后怎么转成GGUF格式的模型呢？

Answer 7 · 2024-05-23T01:08:40.000Z

https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/wiki/llamacpp_zh#step-2-生成量化版本模型

Answer 8 · 2024-05-24T08:01:30.000Z

当更改默认系统提示词 DEFAULT_SYSTEM_PROMPT ，将其内容变多后，训练时也会报 CUDA out of memory 错误。

Answer 9 · 2024-05-27T03:04:28.000Z

colab T4 可以--load_in_kbits 8吗？内存会不够吗

Answer 10 · 2024-06-10T22:06:30.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

Answer 11 · 2024-06-18T22:05:55.000Z

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.