请问如何支持正常的batch infer ?
zhangyu68 opened this issue · 2 comments
zhangyu68 commented
想用正常的batch infer,我使用的是 Triton 部署模型,TritonClient进行infer;尝试封装成一个batch的形式直接调用,会有如下报错:
File /opt/conda/envs/pai_H800/lib/python3.8/site-packages/tritonclient/http/_utils.py:69, in _raise_if_error(response)
67 error = _get_error(response)
68 if error is not None:
---> 69 raise error
InferenceServerException: [400] in ensemble 'ensemble', Inflight batching backend expects requests with batch size of 1.
Tlntin commented
Inflight batching
只支持batch_size=1,但是可以多线程请求,让它自己拼接。
如果想尝试单次多batch,可以将这个gpt_model_type换成"V1",并且编译engine的时候也关掉paged-kv-cache功能。
zhangyu68 commented
感谢回答~