运行run.py报错，Segmentation fault (core dumped)

Question

运行run.py报错，Segmentation fault (core dumped)

Closed this issue 7 months ago · 8 comments

基本环境:

torch==2.1.0
tensorrt_llm==0.7.0
pip install transformers==4.38.2
pip install accelerate==0.27.2

运行build.py

python build.py --hf_model_dir ./tmp/Qwen1.5/14B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/Qwen1.5/14B/trt_engines/fp16/1-gpu/

生成成功，但是运行run.py文件，报错

命令如下:

python run.py --input_text "你好，请问你叫什么？" \
                  --max_new_tokens=50 \
                  --tokenizer_dir ./tmp/Qwen1.5/14B/ \
                  --engine_dir=./tmp/Qwen1.5/14B/trt_engines/fp16/1-gpu/

报错信息:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[deeplearning-use-1-tr034784-0:95188] *** Process received signal ***
[deeplearning-use-1-tr034784-0:95188] Signal: Segmentation fault (11)
[deeplearning-use-1-tr034784-0:95188] Signal code: Address not mapped (1)
[deeplearning-use-1-tr034784-0:95188] Failing at address: 0x440000e9
[deeplearning-use-1-tr034784-0:95188] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fcc2f8cf420]
[deeplearning-use-1-tr034784-0:95188] [ 1] /usr/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Comm_set_errhandler+0x47)[0x7fca05d9cfc7]
[deeplearning-use-1-tr034784-0:95188] [ 2] /home/powerop/work/conda/envs/qwen_tensorrt/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x9abf0)[0x7fc9e50a2bf0]
[deeplearning-use-1-tr034784-0:95188] [ 3] /home/powerop/work/conda/envs/qwen_tensorrt/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x2decf)[0x7fc9e5035ecf]
[deeplearning-use-1-tr034784-0:95188] [ 4] python(PyModule_ExecDef+0x70)[0x597be0]
[deeplearning-use-1-tr034784-0:95188] [ 5] python[0x598f69]
[deeplearning-use-1-tr034784-0:95188] [ 6] python[0x4fcf3b]
[deeplearning-use-1-tr034784-0:95188] [ 7] python(_PyEval_EvalFrameDefault+0x5a35)[0x4f3375]
[deeplearning-use-1-tr034784-0:95188] [ 8] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [ 9] python(_PyEval_EvalFrameDefault+0x4b26)[0x4f2466]
[deeplearning-use-1-tr034784-0:95188] [10] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [11] python(_PyEval_EvalFrameDefault+0x731)[0x4ee071]
[deeplearning-use-1-tr034784-0:95188] [12] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [13] python(_PyEval_EvalFrameDefault+0x31f)[0x4edc5f]
[deeplearning-use-1-tr034784-0:95188] [14] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [15] python(_PyEval_EvalFrameDefault+0x31f)[0x4edc5f]
[deeplearning-use-1-tr034784-0:95188] [16] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [17] python[0x4fd0d4]
[deeplearning-use-1-tr034784-0:95188] [18] python(_PyObject_CallMethodIdObjArgs+0x137)[0x50be37]
[deeplearning-use-1-tr034784-0:95188] [19] python(PyImport_ImportModuleLevelObject+0x525)[0x50b195]
[deeplearning-use-1-tr034784-0:95188] [20] python[0x516f44]
[deeplearning-use-1-tr034784-0:95188] [21] python[0x4fd4c7]
[deeplearning-use-1-tr034784-0:95188] [22] python(PyObject_Call+0x209)[0x509d69]
[deeplearning-use-1-tr034784-0:95188] [23] python(_PyEval_EvalFrameDefault+0x5a35)[0x4f3375]
[deeplearning-use-1-tr034784-0:95188] [24] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [25] python(_PyEval_EvalFrameDefault+0x31f)[0x4edc5f]
[deeplearning-use-1-tr034784-0:95188] [26] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [27] python[0x4fd0d4]
[deeplearning-use-1-tr034784-0:95188] [28] python(_PyObject_CallMethodIdObjArgs+0x137)[0x50be37]
[deeplearning-use-1-tr034784-0:95188] [29] python(PyImport_ImportModuleLevelObject+0x9da)[0x50b64a]
[deeplearning-use-1-tr034784-0:95188] *** End of error message ***
Segmentation fault (core dumped)

想咨询一下，这是什么问题呢？感谢~

Answer 1 · 2024-03-21T02:27:02.000Z

可能是环境问题，建议用容器安装。

Answer 2 · 2024-03-21T02:43:49.000Z

可能是环境问题，建议用容器安装。

嗷嗷，但是我这边已经是docker环境的云服务器了，会涉及到包的版本问题吗？和triton版本等有关系吗？

Answer 3 · 2024-03-21T02:45:25.000Z

可能是环境问题，建议用容器安装。

嗷嗷，但是我这边已经是docker环境的云服务器了，会涉及到包的版本问题吗？和triton版本等有关系吗？

可以用官方的triton容器安装，对了，你的显卡是？

Answer 4 · 2024-03-21T02:49:57.000Z

可能是环境问题，建议用容器安装。

嗷嗷，但是我这边已经是docker环境的云服务器了，会涉及到包的版本问题吗？和triton版本等有关系吗？

可以用官方的triton容器安装，对了，你的显卡是？

显卡是A100

Answer 5 · 2024-03-21T02:52:51.000Z

噢·那应该没啥问题，可以用官方triton容器再试试。

Answer 6 · 2024-03-21T02:57:04.000Z

噢·那应该没啥问题，可以用官方triton容器再试试。

好的，感谢，我再试试，需要我按照 tensorrtllm_backend0.5.0分支配合23.10 triton吗？
还使用最新的，tensorrtllm_backend0.8.0 +24.01 triton呢？
我这边想要跑通的是qwen1.5-14b-chat的模型

Answer 7 · 2024-03-21T02:58:01.000Z

Tlntin commented 7 months ago

Answer 8 · 2024-03-21T02:59:23.000Z

嗷嗷，好的，非常感谢~