ymcui/Chinese-LLaMA-Alpaca-3

多卡训练会报错 terminate called after throwing an instance of 'c10::Error' what(): CUDA error: unspecified launch failure

cc8476 opened this issue · 1 comments

提交前必须检查以下项目

  • 请确保使用的是仓库最新代码(git pull)
  • 已阅读项目文档FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案。
  • 第三方插件问题:例如llama.cpptext-generation-webui等,建议优先去对应的项目中查找解决方案。

问题类型

模型训练与精调

基础模型

Others

操作系统

Linux

详细描述问题

使用多卡训练脚本 
torchrun --nnodes 1 --nproc_per_node 2 run_clm_pt_with_peft.py
会报错 (nproc_per_node 设置1不会,能正常完成预训练)
网上查了一整天,主要是检查环节,感觉都没问题,实在查不到问题了...


以下是我的环境:
模型:llama3_8b (官方的基座模型)
硬件:
H800 *8
软件:
torch.__version__
2.3.1
torch.version.cuda
12.1

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

另外
NCCL
nvidia-fabricmanager
都已经装好且正常运行

依赖情况(代码类问题务必提供)

# 请在此处粘贴依赖情况(请粘贴在本代码块里)

运行日志或截图


[INFO|tokenization_utils_base.py:2159] 2024-07-15 18:39:43,339 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2159] 2024-07-15 18:39:43,339 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2159] 2024-07-15 18:39:43,339 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2159] 2024-07-15 18:39:43,339 >> loading file tokenizer_config.json
[WARNING|logging.py:313] 2024-07-15 18:39:43,604 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
training datasets-wikipedia-cn-20230720-filtered has been loaded from disk
Caching indices mapping at /llm/trans_recorder/7_15_2/data_cache/wikipedia-cn-20230720-filtered_1024/train/cache-2fbc8bad28044f1f.arrow
07/15/2024 18:39:43 - INFO - datasets.arrow_dataset - Caching indices mapping at /llm/trans_recorder/7_15_2/data_cache/wikipedia-cn-20230720-filtered_1024/train/cache-2fbc8bad28044f1f.arrow
Caching indices mapping at /llm/trans_recorder/7_15_2/data_cache/wikipedia-cn-20230720-filtered_1024/train/cache-940cea3e270b30f4.arrow
07/15/2024 18:39:43 - INFO - datasets.arrow_dataset - Caching indices mapping at /llm/trans_recorder/7_15_2/data_cache/wikipedia-cn-20230720-filtered_1024/train/cache-940cea3e270b30f4.arrow
07/15/2024 18:39:44 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
[WARNING|logging.py:313] 2024-07-15 18:39:44,945 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1716905979055/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2496f78897 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2496f28b25 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f249732f718 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1db46 (0x7f24972fab46 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f5e3 (0x7f24972fc5e3 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1f922 (0x7f24972fc922 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5a5950 (0x7f2495faf950 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6a36f (0x7f2496f5d36f in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f2496f561cb in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f2496f56379 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0xe5d280 (0x7f244879d280 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x57692d2 (0x7f248e7cf2d2 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5773d00 (0x7f248e7d9d00 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5773e05 (0x7f248e7d9e05 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x4db0e26 (0x7f248de16e26 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x175be98 (0x7f248a7c1e98 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x577e1b4 (0x7f248e7e41b4 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x577ef65 (0x7f248e7e4f65 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd21ca8 (0x7f249672bca8 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47def4 (0x7f2495e87ef4 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #20: /root/miniconda3/envs/myenv/bin/python() [0x4fd4c7]
frame #21: _PyObject_MakeTpCall + 0x25b (0x4f6c5b in /root/miniconda3/envs/myenv/bin/python)
frame #22: /root/miniconda3/envs/myenv/bin/python() [0x5093cf]
frame #23: _PyEval_EvalFrameDefault + 0x13b3 (0x4eecf3 in /root/miniconda3/envs/myenv/bin/python)
frame #24: _PyFunction_Vectorcall + 0x6f (0x4fd90f in /root/miniconda3/envs/myenv/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x2b79 (0x4f04b9 in /root/miniconda3/envs/myenv/bin/python)
frame #26: _PyFunction_Vectorcall + 0x6f (0x4fd90f in /root/miniconda3/envs/myenv/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x4b26 (0x4f2466 in /root/miniconda3/envs/myenv/bin/python)
frame #28: /root/miniconda3/envs/myenv/bin/python() [0x5717c7]
frame #29: /root/miniconda3/envs/myenv/bin/python() [0x4fdaf4]
frame #30: _PyEval_EvalFrameDefault + 0x31f (0x4edc5f in /root/miniconda3/envs/myenv/bin/python)
frame #31: /root/miniconda3/envs/myenv/bin/python() [0x509367]
frame #32: _PyEval_EvalFrameDefault + 0x2818 (0x4f0158 in /root/miniconda3/envs/myenv/bin/python)
frame #33: _PyFunction_Vectorcall + 0x6f (0x4fd90f in /root/miniconda3/envs/myenv/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x31f (0x4edc5f in /root/miniconda3/envs/myenv/bin/python)
frame #35: /root/miniconda3/envs/myenv/bin/python() [0x595062]
frame #36: PyEval_EvalCode + 0x87 (0x594fa7 in /root/miniconda3/envs/myenv/bin/python)
frame #37: /root/miniconda3/envs/myenv/bin/python() [0x5c5e17]
frame #38: /root/miniconda3/envs/myenv/bin/python() [0x5c0f60]
frame #39: /root/miniconda3/envs/myenv/bin/python() [0x4595b6]
frame #40: _PyRun_SimpleFileObject + 0x19f (0x5bb4ef in /root/miniconda3/envs/myenv/bin/python)
frame #41: _PyRun_AnyFileObject + 0x43 (0x5bb253 in /root/miniconda3/envs/myenv/bin/python)
frame #42: Py_RunMain + 0x38d (0x5b800d in /root/miniconda3/envs/myenv/bin/python)
frame #43: Py_BytesMain + 0x39 (0x588299 in /root/miniconda3/envs/myenv/bin/python)
frame #44: <unknown function> + 0x29d90 (0x7f24b6591d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #45: __libc_start_main + 0x80 (0x7f24b6591e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #46: /root/miniconda3/envs/myenv/bin/python() [0x58814e]

W0715 18:39:49.066000 139889817904960 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 13549 closing signal SIGTERM
W0715 18:40:19.066000 139889817904960 torch/distributed/elastic/multiprocessing/api.py:868] Unable to shutdown process 13549 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
^CW0715 18:41:12.757000 139889817904960 torch/distributed/elastic/agent/server/api.py:741] Received Signals.SIGINT death signal, shutting down workers
W0715 18:41:12.757000 139889817904960 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 13549 closing signal SIGINT
^CW0715 18:41:12.948000 139889817904960 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 13549 closing signal SIGTERM```

升级cuda 12.1 到cuda12.2就好了...