EricLBuehler/candle-vllm

Batching and VLLM-style kv caching missing

michaelfeil opened this issue ยท 7 comments

Your implementation is looking great so far.

I got a bit confused, as with the name vllm, I would have expected that two features are implemented:

Is there a plan to support them?

Hi @michaelfeil, I do plan on adding these features. This project is still in active development, and those features are not implemented yet.

Please feel free to contribute these features! The official vllm project is very large, and I would appreciate any contributions!

How is this project related to vllm? Why the name? What does vllm mean?

Here is the repo https://github.com/vllm-project/vllm and some theoretical background from the authors
https://arxiv.org/pdf/2309.06180.pdf

@michaelfeil Yes, I know about vllm. I'm confused by the name candle-vllm. I wonder if candle-vllm is going to replicate vllm in Rust completely or to build inference platform in Rust / candle for more general purpose other than vllm.

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

@liebkne, candle-vllm seeks to replicate most of vllm in Rust. Of course, essentials such as the OpenAI API server will be implemented first, followed by sampling techniques and PagedAttention/kv-cache.

@michaelfeil, @liebkne: Please see the paged_attention branch, where the PagedAttention mechanism is now being developed!

Closing this issue to prevent staleness - please feel free to reopen. See #14.