The Easiest Rust Interface for Local LLMs

# For Mac (CPU and GPU), windows (CPU and CUDA), or linux (CPU and CUDA)
llm_client="*"

This will download and build llama.cpp. See build.md for other features and backends like mistral.rs.

use Llmclient::prelude::*;
// Loads the largest quant available based on your VRAM or system memory
let llm_client = LlmClient::llama_cpp()
    .mistral7b_instruct_v0_3() // Uses a preset model
    .init() // Downloads model from hugging face and starts the inference interface
    .await?;

Several of the most common models are available as presets. Loading from local models is also fully supported. See models.md for more information.

Features

Automated build and support for CPU, CUDA, MacOS
Easy model presets and quant selection
Novel cascading prompt workflow for CoT and NLP workflows. DIY workflow creation supported!
Breadth of configuration options (sampler params, retry logic, prompt caching, logit bias, grammars, etc)
API support for OpenAI, Anthropic, Perplexity, and any OpenAI compatible API

An Interface for Deterministic Signals from Probabilistic LLM Vibes

In addition to basic LLM inference, llm_client is primarily designed for controlled generation using step based cascade workflows. This prompting system runs pre-defined workflows that control and constrain both the overall structure of generation and individual tokens during inference. This allows the implementation of specialized workflows for specific tasks, shaping LLM outputs towards intended, reproducible outcomes.

let response: u32 = llm_client.reason().integer()
    .instructions()
    .set_content("Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?")
    .return_primitive().await?;

// Recieve 'primitive' outputs
assert_eq!(response, 1)

This runs the reason one round cascading prompt workflow with an integer output.

This method significantly improves the reliability of LLM use cases. For example, there are test cases this repo that can be used to benchmark an LLM. There is a large increase in accuracy when comparing basic inference with a constrained outcome and a CoT style cascading prompt workflow. The decision workflow that runs N count of CoT workflows across a temperature gradient approaches 100% accuracy for the test cases.

I have a full breakdown of this in my blog post, "Step-Based Cascading Prompts: Deterministic Signals from the LLM Vibe Space."

Jump to the readme.md of the llm_client crate to find out how to use them.

Examples

device config - customizing your inference config
basic completion - the most basic request available
basic primitive - returns the request primitive
reason - a cascade workflow that performs CoT reasoning before returning a primitive
decision - uses the reason workflow N times across a temperature gradient
extract urls - a cascade workflow that extracts all URLs from text that meet a predict

Docs

llm_client readme.md
docs directory

Guides

Limiting power in Nvidia GPUs

Blog Posts

Step-Based Cascading Prompts: Deterministic Signals from the LLM Vibe Space

Roadmap

Improve the Cascading workflow API to be easier.
Refactor the benchmarks module for easy model comparison.
WebUI client for local consumption.
Server mode for "LLM-in-a-box" deployments
Full Rust inference via mistral.rs or candle.

Dependencies

llm_utils is a sibling crate that was split from the llm_client. If you just need prompting, tokenization, model loading, etc, I suggest using the llm_utils crate on it's own.
llm_interface is a sub-crate of llm_client. It is the backend for LLM inference.
llm_devices is a sub-crate of llm_client. It contains device and build managment behavior.
llama.cpp is used in server mode for LLM inference as the current default.
mistral.rs is available for basic use, but is a WIP.

Contact

Shelby Jenkins - Here or Linkedin

ShelbyJenkins/llm_client