Tlntin/Qwen-TensorRT-LLM

运行run.py报错,Segmentation fault (core dumped)

Closed this issue · 8 comments

基本环境:

torch==2.1.0
tensorrt_llm==0.7.0
pip install transformers==4.38.2
pip install accelerate==0.27.2

运行build.py

python build.py --hf_model_dir ./tmp/Qwen1.5/14B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/Qwen1.5/14B/trt_engines/fp16/1-gpu/


生成成功,但是运行run.py文件,报错

命令如下:

python run.py --input_text "你好,请问你叫什么?" \
                  --max_new_tokens=50 \
                  --tokenizer_dir ./tmp/Qwen1.5/14B/ \
                  --engine_dir=./tmp/Qwen1.5/14B/trt_engines/fp16/1-gpu/


报错信息:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[deeplearning-use-1-tr034784-0:95188] *** Process received signal ***
[deeplearning-use-1-tr034784-0:95188] Signal: Segmentation fault (11)
[deeplearning-use-1-tr034784-0:95188] Signal code: Address not mapped (1)
[deeplearning-use-1-tr034784-0:95188] Failing at address: 0x440000e9
[deeplearning-use-1-tr034784-0:95188] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fcc2f8cf420]
[deeplearning-use-1-tr034784-0:95188] [ 1] /usr/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Comm_set_errhandler+0x47)[0x7fca05d9cfc7]
[deeplearning-use-1-tr034784-0:95188] [ 2] /home/powerop/work/conda/envs/qwen_tensorrt/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x9abf0)[0x7fc9e50a2bf0]
[deeplearning-use-1-tr034784-0:95188] [ 3] /home/powerop/work/conda/envs/qwen_tensorrt/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x2decf)[0x7fc9e5035ecf]
[deeplearning-use-1-tr034784-0:95188] [ 4] python(PyModule_ExecDef+0x70)[0x597be0]
[deeplearning-use-1-tr034784-0:95188] [ 5] python[0x598f69]
[deeplearning-use-1-tr034784-0:95188] [ 6] python[0x4fcf3b]
[deeplearning-use-1-tr034784-0:95188] [ 7] python(_PyEval_EvalFrameDefault+0x5a35)[0x4f3375]
[deeplearning-use-1-tr034784-0:95188] [ 8] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [ 9] python(_PyEval_EvalFrameDefault+0x4b26)[0x4f2466]
[deeplearning-use-1-tr034784-0:95188] [10] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [11] python(_PyEval_EvalFrameDefault+0x731)[0x4ee071]
[deeplearning-use-1-tr034784-0:95188] [12] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [13] python(_PyEval_EvalFrameDefault+0x31f)[0x4edc5f]
[deeplearning-use-1-tr034784-0:95188] [14] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [15] python(_PyEval_EvalFrameDefault+0x31f)[0x4edc5f]
[deeplearning-use-1-tr034784-0:95188] [16] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [17] python[0x4fd0d4]
[deeplearning-use-1-tr034784-0:95188] [18] python(_PyObject_CallMethodIdObjArgs+0x137)[0x50be37]
[deeplearning-use-1-tr034784-0:95188] [19] python(PyImport_ImportModuleLevelObject+0x525)[0x50b195]
[deeplearning-use-1-tr034784-0:95188] [20] python[0x516f44]
[deeplearning-use-1-tr034784-0:95188] [21] python[0x4fd4c7]
[deeplearning-use-1-tr034784-0:95188] [22] python(PyObject_Call+0x209)[0x509d69]
[deeplearning-use-1-tr034784-0:95188] [23] python(_PyEval_EvalFrameDefault+0x5a35)[0x4f3375]
[deeplearning-use-1-tr034784-0:95188] [24] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [25] python(_PyEval_EvalFrameDefault+0x31f)[0x4edc5f]
[deeplearning-use-1-tr034784-0:95188] [26] python(_PyFunction_Vectorcall+0x6f)[0x4fd90f]
[deeplearning-use-1-tr034784-0:95188] [27] python[0x4fd0d4]
[deeplearning-use-1-tr034784-0:95188] [28] python(_PyObject_CallMethodIdObjArgs+0x137)[0x50be37]
[deeplearning-use-1-tr034784-0:95188] [29] python(PyImport_ImportModuleLevelObject+0x9da)[0x50b64a]
[deeplearning-use-1-tr034784-0:95188] *** End of error message ***
Segmentation fault (core dumped)

想咨询一下,这是什么问题呢?感谢~

可能是环境问题,建议用容器安装。

可能是环境问题,建议用容器安装。

嗷嗷,但是我这边已经是docker环境的云服务器了,会涉及到包的版本问题吗?和triton版本等有关系吗?

可能是环境问题,建议用容器安装。

嗷嗷,但是我这边已经是docker环境的云服务器了,会涉及到包的版本问题吗?和triton版本等有关系吗?

可以用官方的triton容器安装,对了,你的显卡是?

可能是环境问题,建议用容器安装。

嗷嗷,但是我这边已经是docker环境的云服务器了,会涉及到包的版本问题吗?和triton版本等有关系吗?

可以用官方的triton容器安装,对了,你的显卡是?

显卡是A100

噢·那应该没啥问题,可以用官方triton容器再试试。

噢·那应该没啥问题,可以用官方triton容器再试试。

好的,感谢,我再试试,需要我按照 tensorrtllm_backend0.5.0分支配合23.10 triton吗?
还使用最新的,tensorrtllm_backend0.8.0 +24.01 triton呢?
我这边想要跑通的是qwen1.5-14b-chat的模型

image

嗷嗷,好的,非常感谢~