Batching and VLLM-style kv caching missing

Question

Batching and VLLM-style kv caching missing

michaelfeil opened this issue a year ago · 7 comments

michaelfeil commented a year ago

Your implementation is looking great so far.

I got a bit confused, as with the name vllm, I would have expected that two features are implemented:

#11
#10

Is there a plan to support them?

Answer 1 · 2023-11-04T23:58:35.000Z

Hi @michaelfeil, I do plan on adding these features. This project is still in active development, and those features are not implemented yet.

Please feel free to contribute these features! The official vllm project is very large, and I would appreciate any contributions!

Answer 2 · 2023-11-07T01:17:14.000Z

How is this project related to vllm? Why the name? What does vllm mean?

Answer 3 · 2023-11-07T06:52:07.000Z

Here is the repo https://github.com/vllm-project/vllm and some theoretical background from the authors
https://arxiv.org/pdf/2309.06180.pdf

Answer 4 · 2023-11-07T07:05:54.000Z

@michaelfeil Yes, I know about vllm. I'm confused by the name candle-vllm. I wonder if candle-vllm is going to replicate vllm in Rust completely or to build inference platform in Rust / candle for more general purpose other than vllm.

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Answer 5 · 2023-11-07T10:08:31.000Z

@liebkne, candle-vllm seeks to replicate most of vllm in Rust. Of course, essentials such as the OpenAI API server will be implemented first, followed by sampling techniques and PagedAttention/kv-cache.

Answer 6 · 2023-11-25T12:15:35.000Z

@michaelfeil, @liebkne: Please see the paged_attention branch, where the PagedAttention mechanism is now being developed!

Answer 7 · 2023-12-11T01:18:51.000Z

Closing this issue to prevent staleness - please feel free to reopen. See #14.