microsoft/MInference

[Question]: The speed of minference in short context

maxin9966 opened this issue · 1 comments

Describe the issue

In the context of short sequences, does minference reduce inference speed or affect the throughput of vllm?

Hi @maxin9966, thank you for your question. Due to the overhead of approximate and sparse index build, the latency can be slightly higher than full-attention for scenarios with context sizes below 10k. You can find the details of latency benchmark results at minference-benchmark-experiments. You can control whether to use MInference based on the context size.