请问如何支持正常的batch infer ？

Question

请问如何支持正常的batch infer ？

zhangyu68 opened this issue 10 months ago · 2 comments

想用正常的batch infer，我使用的是 Triton 部署模型，TritonClient进行infer；尝试封装成一个batch的形式直接调用，会有如下报错：

File /opt/conda/envs/pai_H800/lib/python3.8/site-packages/tritonclient/http/_utils.py:69, in _raise_if_error(response)
67 error = _get_error(response)
68 if error is not None:
---> 69 raise error

InferenceServerException: [400] in ensemble 'ensemble', Inflight batching backend expects requests with batch size of 1.

Answer 1 · 2024-03-13T01:58:01.000Z

Inflight batching只支持batch_size=1，但是可以多线程请求，让它自己拼接。

如果想尝试单次多batch，可以将这个gpt_model_type换成"V1"，并且编译engine的时候也关掉paged-kv-cache功能。

Answer 2 · 2024-03-14T06:17:51.000Z

感谢回答~