Tlntin/Qwen-TensorRT-LLM

ERROR: Failed to create instance: unexpected error when creating modelInstanceState

lyc728 opened this issue · 3 comments

lyc728 commented

[MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5465, now: CPU 0, GPU 5465 (MiB)
E0131 07:18:59.626800 38258 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: When KV cache block reuse is set, model has to be built with paged context FMHA support (/home/jenkins/agent/workspace/LLM/release-0.7/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:140)
1 0x7fbea749d68d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1668d) [0x7fbea749d68d]
2 0x7fbea74a33de /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1c3de) [0x7fbea74a33de]
3 0x7fbea74f64c4 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6f4c4) [0x7fbea74f64c4]
4 0x7fbea74ed308 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66308) [0x7fbea74ed308]
5 0x7fbea74cee6c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x47e6c) [0x7fbea74cee6c]
6 0x7fbea74cff12 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48f12) [0x7fbea74cff12]
7 0x7fbea74bfd65 TRITONBACKEND_ModelInstanceInitialize + 101
8 0x7fbf1959aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7fbf1959aa86]
9 0x7fbf1959bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7fbf1959bcc6]
10 0x7fbf1957ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7fbf1957ec15]
11 0x7fbf1957f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7fbf1957f256]
12 0x7fbf1958b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7fbf1958b27d]
13 0x7fbf18bf9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fbf18bf9ee8]
14 0x7fbf1957597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7fbf1957597b]
15 0x7fbf19585695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7fbf19585695]
16 0x7fbf1958a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7fbf1958a50b]
17 0x7fbf19673610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7fbf19673610]
18 0x7fbf19676d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7fbf19676d03]
19 0x7fbf197c38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7fbf197c38b2]
20 0x7fbf18e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fbf18e64253]
21 0x7fbf18bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fbf18bf4ac3]
22 0x7fbf18c85bf4 clone + 68

当我把tensorrt_llm/config.pbtxt修改了后,启动triton服务后报错
parameters: {
key: "enable_trt_overlap"
value: {
#string_value: "${enable_trt_overlap}"
string_value: "False"
}
}
parameters: {
key: "enable_kv_cache_reuse"
value: {
#string_value: "${enable_kv_cache_reuse}"
string_value: "True"
}
}
看到官方的是要修改use_paged_context_fmha as true 但是build并没有这个参数,你遇到这个问题了吗?triton-inference-server/tensorrtllm_backend#271

Tlntin commented

感觉是显存不够用导致的,可以尝试降低一下输入输出长度。

lyc728 commented

不是显存的问题,我只是改了parameters: {
key: "enable_kv_cache_reuse"
value: {
#string_value: "${enable_kv_cache_reuse}"
string_value: "True"
}
}
这个参数我看文档(https://github.com/triton-inference-server/tensorrtllm_backend/tree/v0.7.0/inflight_batcher_llm)说了,如果进行In-flight Batching 是需要进行设置的,即为False就可以正常加载。
同时这边我还有一些关于In-flight Batching的疑问,因为我看到
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.7.0/examples/qwen
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.7.1/examples/qwen
Support Matrix
FP16
INT8 & INT4 Weight-Only
SmoothQuant
INT8 KV CACHE
Tensor Parallel
STRONGLY TYPED
这边并没有支持,只有gpt的模型我看文档才明确的说支持这个操作的,现在qwen是不支持吗?(因为我这边启动triton进行多线程推理同直接运行https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.0/examples/run.py这个函数推理,经过多次实验之后,发现并没有体现加速地方),还望博主解答下

Please follow the issue template to share the details and reproduced steps. Also, please use english to describe your issue. Thank you for cooperation.