[Question]: The speed of minference in short context

Question

[Question]: The speed of minference in short context

maxin9966 opened this issue 2 months ago · 1 comments

Describe the issue

In the context of short sequences, does minference reduce inference speed or affect the throughput of vllm?

Answer 1 · 2024-07-15T02:04:56.000Z

Hi @maxin9966, thank you for your question. Due to the overhead of approximate and sparse index build, the latency can be slightly higher than full-attention for scenarios with context sizes below 10k. You can find the details of latency benchmark results at minference-benchmark-experiments. You can control whether to use MInference based on the context size.