/candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Primary LanguageRustMIT LicenseMIT

candle-vllm

Continuous integration

Efficient platform for inference and serving local LLMs including an OpenAI compatible API server.

Features

  • OpenAI compatible API server provided for serving LLMs.
  • Highly extensible trait-based system to allow rapid implementation of new module pipelines,
  • Streaming support in generation.

Overview

One of the goals of candle-vllm is to interface locally served LLMs using an OpenAI compatible API server.

  1. During initial setup: the model, tokenizer and other parameters are loaded.

  2. When a request is received:

  3. Sampling parameters are extracted, including n - the number of choices to generate.

  4. The request is converted to a prompt which is sent to the model pipeline.

    • If a streaming request is received, token-by-token streaming using SSEs is established (n choices of 1 token).
    • Otherwise, a n choices are generated and returned.

Contributing

The following features are planned to be implemented, but contributions are especially welcome:

  • Sampling methods:
  • Pipeline batching (#3)
  • KV cache (#3)
  • PagedAttention (#3)
  • More pipelines (from candle-transformers)

Resources