Add batched inference
abetlen opened this issue Β· 37 comments
- Use
llama_decode
instead of deprecatedllama_eval
inLlama
class - Implement batched inference support for
generate
andcreate_completion
methods inLlama
class - Add support for streaming / infinite completion
Silly question, does that also support for parallel decoding in llama.cpp?
Does the newest version support "batched decoding" of llama.cpp?
This would be a huge improvement for production use.
I tested locally with 4 parallel requests to the built-in ./server binary in llama.cpp and am able to hit some insanely good tokens/sec -- multiple times faster than what we get with a single request via the non-batched inference.
This would be a huge improvement for production use.
I tested locally with 4 parallel requests to the built-in ./server binary in llama.cpp and am able to hit some insanely good tokens/sec -- multiple times faster than what we get with a single request via the non-batched inference.
@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!
@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!
There's 2 new flags in llama.cpp to add to your normal command -cb -np 4
(cb = continuous batching, np = parallel request count).
@
@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!
There's 2 new flags in llama.cpp to add to your normal command
-cb -np 4
(cb = continuous batching, np = parallel request count).
Thanks, that works for me with llama.cpp
, but not llama-cpp-python
, which I think is expected. Unfortunately, the server API in llama.cpp
here doesn't seem to be as good as the server in llama-cpp-python
, at least for my task. Using the same llama model, I get better results with llama-cpp-python
. So, I hope this can be added soon!
When will this feature be available? I hope anyone can help solve this problem please.
Let me know if there are any roadblocks - I might be able to provide some insight
Hey @ggerganov I missed this earlier.
Thank you, yeah I just need some quick clarifications around the kv cache behaviour.
The following is my understanding of the kv_cache
implementation
- The kv cache starts with a number of free cells initially equal to
n_ctx
- If the number of free cells gets down to 0 the kv cache / available context is full and some cells must be cleared to process any more tokens
- When calling
llama_decode
,batch.n_tokens
can only be as large as the largest free slot, ifn_tokens
is too large (llama_decode
returns >1) you reduce the batch size it and retry - The number of occupied cells increases by
batch.n_tokens
on every call tollama_decode
- The number of free cells increases when an occupied cell no longer belongs to any sequences or is shifted to
pos < 0
- Calling
llama_kv_cache_seq_cp
does not use cause any additional free cells to be occupied, the copy is "shallow" and only adds the new sequence id to the set - Calling
llama_kv_cache_shift
works by modifying the kv cells that belong to a given sequence however this also shifts this cell in all of the other sequences it belongs to
Is this correct?
Yes, all of this is correct.
Calling llama_kv_cache_shift works by modifying the kv cells that belong to a given sequence however this also shifts this cell in all of the other sequences it belongs to
This call also sets a flag that upon the next llama_decode
, the computation will first shift the KV cache data before proceeding as usual.
Will soon add a couple of functions to the API that can be useful for monitoring the KV cache state:
One of the main applications of llama_kv_cache_seq_cp
is to "share" a common prompt (i.e. same tokens at the same positions) across multiple sequences. Most trivial example is a system prompt which is at the start for all generated sequences. By sharing it, the KV cache will be reused and thus less memory will be consumed, instead of having a copy for each sequences.
I updated the version and saw the batch configuration. But when I ran it, the batch didn't take effect.When I send multiple requests, it still handles them one by one. My startup configuration is as follows:
python3 -m llama_cpp.server --model ./models/WizardLM-13B-V1.2/ggml-model-f16-Q5.gguf --n_gpu_layers 2 --n_ctx 8000 --n_batch 512 --n_threads 10 --n_threads_batch 10 --interrupt_requests False
Is there something wrong with my configuration? @abetlen
@zpzheng Itβs a draft PR so itβs not complete - you can see βAdd support for parallel requestsβ is in the todo list
+1, would be really great to have this
+1, would be so great to have this!
+1
+1
+1
+1
Guys, any other solution in this??
+1
+1
+1
I use llama-cpp-python as (presently) the sole first-class-support backend in a project I've been developing. I'm not sure how much it would benefit from batching, as I've yet to do performance testing against other backends, but I feel like it could be a significant boon.
What's the current status of this and #951? I might be interested in taking a look at this, but I'm not certain I'd bring much to the table, I'll have to review the related code more.
I use llama-cpp-python as (presently) the sole first-class-support backend in a project I've been developing.
I would not do this. batching is super important and I had to move to llama.cpp's server (easy to deploy w/ docker or python, or even just the exe) because of lack of features on llama-cpp-python. If you're doing CPU inference, llama.cpp is a great option, otherwise I would use something like vLLM, BentoML's OpenLLM, or Predibase's LoRAx
I would not do this. batching is super important and I had to move to llama.cpp's server
This is something I was considering, appreciate the advice. I'll likely end up doing that. I had to do the same with Ollama, but I wasn't on Ollama long and by no means felt it was the right fit for the job, support for it merely started from a peer showing interest and my compulsion to explore all viable options where possible.
I'm doing GPU inference and sadly that means Nvidia's antics have hindered me from getting things running in a container just the way I'd like them to up until now... but that's another story. I haven't tried vLLM, OpenLLM or LoRAx, llama.cpp and llama-cpp-python have generally been all I've needed up till now (and for longer, I hope-- I really appreciate the work done by all contributors to both projects, exciting that we're at least where we are today). Are those libraries any good if you're looking to do something with the perplexity of say q6_k on a (VRAM) budget? I'd prefer to be able to run it on my 1080Ti, even when I have access to more VRAM in another environment.
I am dealing with this right now -- and unfortunately llama-cpp-server has the only completions endpoint that I can find that supports logprobs properly (random hard requirement I have). llama.cpp doesn't natively support it in its API, which is why I use llama-cpp-python.
vllm is great for big GPUs, however, it doesn't support gguf-quantized models or running at full-precision for smaller GPUs (I use T4s).
Whereas ideally I could run llama-cpp-python on an instance with 4 T4s and batch across them (like vllm can do), I am now creating instances with 1 gpu and scaling horizontally.
If anyone knows of a better openai-compatible API endpoint that wraps llama.cpp, I am listening, but I haven't found one.
I am dealing with this right now -- and unfortunately llama-cpp-server has the only completions endpoint that I can find that supports logprobs properly (random hard requirement I have). llama.cpp doesn't natively support it in its API, which is why I use llama-cpp-python.
vllm is great for big GPUs, however, it doesn't support gguf-quantized models or running at full-precision for smaller GPUs (I use T4s).
Whereas ideally I could run llama-cpp-python on an instance with 4 T4s and batch across them (like vllm can do), I am now creating instances with 1 gpu and scaling horizontally.
If anyone knows of a better openai-compatible API endpoint that wraps llama.cpp, I am listening, but I haven't found one.
Have you tried https://github.com/ollama/ollama?
ollama doesn't support batched inference what a silly suggestion.
I case this is useful to others, as a workaround until this is implemented, I wrote a tiny python library that
- downloads and installs the raw llama.cpp server binary
- downloads some model weights from huggingface hub
- provides a simple
Server
class to control starting/stopping the binary
This was needed because the raw server binary supports batched inference (and supports structured output, another requirement I have). All the heavy logic is already in the upstream C server, so all I needed to do was do the CLI and subprocess logic.