Batching and VLLM-style kv caching missing
michaelfeil opened this issue ยท 7 comments
Hi @michaelfeil, I do plan on adding these features. This project is still in active development, and those features are not implemented yet.
Please feel free to contribute these features! The official vllm
project is very large, and I would appreciate any contributions!
How is this project related to vllm? Why the name? What does vllm mean?
Here is the repo https://github.com/vllm-project/vllm and some theoretical background from the authors
https://arxiv.org/pdf/2309.06180.pdf
@michaelfeil Yes, I know about vllm. I'm confused by the name candle-vllm
. I wonder if candle-vllm
is going to replicate vllm
in Rust completely or to build inference platform in Rust / candle for more general purpose other than vllm.
Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.
@liebkne, candle-vllm
seeks to replicate most of vllm
in Rust. Of course, essentials such as the OpenAI API server will be implemented first, followed by sampling techniques and PagedAttention/kv-cache.
@michaelfeil, @liebkne: Please see the paged_attention branch, where the PagedAttention mechanism is now being developed!
Closing this issue to prevent staleness - please feel free to reopen. See #14.