/deepseek-ocr.rs

Rust implementation of DeepSeek-OCR with OpenAI-compatible server & CLI No Python environment needed - just download and run.

Primary LanguageRustApache License 2.0Apache-2.0

deepseek-ocr.rs πŸš€

Rust implementation of the DeepSeek-OCR inference stack with a fast CLI and an OpenAI-compatible HTTP server. The workspace packages the vision-language model, prompt tooling, and serving layer so you can build document understanding pipelines that run locally on CPU, Apple Metal, or (alpha) NVIDIA CUDA GPUs.

δΈ­ζ–‡ζ–‡ζ‘£θ―·ηœ‹ README_CN.md。

Want ready-made binaries? Latest macOS (Metal-enabled) and Windows bundles live in the build-binaries workflow artifacts. Grab them from the newest green run.

Inside crates/core πŸ”¬

  • Vision preprocessing – prepare_vision_input_from_image builds a square global canvas with letterboxing (build_global_view) and, when crop mode is enabled, applies dynamic_preprocess tiling to produce high-resolution local crops plus optional thumbnails.
  • SAM + CLIP fusion – each view is normalised via image_to_tensor, pushed through the Candle ports of SAM (SamBackbone) and CLIP-L (ClipVisionModel), then flattened with build_clip_sam_tokens so the features stay spatially aligned.
  • Projector & layout tokens – the custom ImageProjector linearly maps concatenated SAM/CLIP channels into the language hidden size while injecting learned image_newline/view_separator tokens to preserve grid structure, yielding the multimodal embeddings used during decoding.
  • Tokenizer alignment – build_prompt_tokens synthesises <image> spans whose length exactly matches the projected token count (global + local grids), ensuring OpenAI-style prompts remain consistent even after chat history pruning.
  • Decoder & caching – the text stack is a Candle reimplementation of DeepSeek-V2 (DeepseekLanguageModel) with optional FlashAttention, rotary position embeddings, and DynamicCache guards so both the CLI and server can stream tokens efficiently.
  • Observability & parity – debug builds expose CLIP/SAM traces (VisionDebugFeatures) so we can diff intermediate tensors against the PyTorch reference; most stages are already numerically aligned, and the few remaining deltas (mainly projector normalisation + vision tiling) are tracked on the roadmap for upcoming releases.

Why Rust? πŸ’‘

The original DeepSeek-OCR ships as a Python + Transformers stackβ€”powerful, but hefty to deploy and awkward to embed. Rewriting the pipeline in Rust gives us:

  • Smaller deployable artifacts with zero Python runtime or conda baggage.
  • Memory-safe, thread-friendly infrastructure that blends into native Rust backends.
  • Unified tooling (CLI + server) running on Candle + Rocket without the Python GIL overhead.
  • Drop-in compatibility with OpenAI-style clients while tuned for single-turn OCR prompts.

Technical Stack βš™οΈ

  • Candle for tensor compute, with Metal and CUDA backends and FlashAttention support.
  • Rocket + async streaming for OpenAI-compatible /v1/responses and /v1/chat/completions.
  • tokenizers (upstream DeepSeek release) wrapped by crates/assets for deterministic caching via Hugging Face and ModelScope mirrors.
  • Pure Rust vision/prompt pipeline shared by CLI and server to avoid duplicated logic.

Advantages over the Python Release πŸ₯·

  • Faster cold-start on Apple Silicon, lower RSS, and native binary distribution.
  • Deterministic dual-source (Hugging Face + ModelScope) asset download + verification built into the workspace.
  • Automatic single-turn chat compaction so OCR outputs stay stable even when clients send history.
  • Ready-to-use OpenAI compatibility for tools like Open WebUI without adapters.

Highlights ✨

  • One repo, two entrypoints – a batteries-included CLI for batch jobs and a Rocket-based server that speaks /v1/responses and /v1/chat/completions.
  • Works out of the box – pulls model weights, configs, and tokenizer from whichever of Hugging Face or ModelScope responds fastest on first run.
  • Optimised for Apple Silicon – optional Metal backend with FP16 execution for real-time OCR on laptops.
  • CUDA (alpha) – experimental support via --features cuda + --device cuda --dtype f16; expect rough edges while we finish kernel coverage.
  • Intel MKL (preview) – faster BLAS on x86 via --features mkl (install Intel oneMKL beforehand).
  • OpenAI client compatibility – drop-in replacement for popular SDKs; the server automatically collapses chat history to the latest user turn for OCR-friendly prompts.

Quick Start 🏁

Prerequisites

  • Rust 1.78+ (edition 2024 support)
  • Git
  • Optional: Apple Silicon running macOS 13+ for Metal acceleration
  • Optional: CUDA 12.2+ toolkit + driver for experimental NVIDIA GPU acceleration on Linux/Windows
  • Optional: Intel oneAPI MKL for preview x86 acceleration (see below)
  • (Recommended) Hugging Face account with HF_TOKEN when pulling from the deepseek-ai/DeepSeek-OCR repo (ModelScope is used automatically when it’s faster/reachable).

Clone the Workspace

git clone https://github.com/TimmyOVO/deepseek-ocr.rs.git
cd deepseek-ocr.rs
cargo fetch

Model Assets

The first invocation of the CLI or server downloads the config, tokenizer, and model-00001-of-000001.safetensors (~6.3GB) into DeepSeek-OCR/. To prefetch manually:

cargo run -p deepseek-ocr-cli --release -- --help # dev profile is extremely slow; always prefer --release

Always include --release when running from source; debug builds on this model are extremely slow. Set HF_HOME/HF_TOKEN if you store Hugging Face caches elsewhere (ModelScope downloads land alongside the same asset tree). The full model package is ~6.3GB on disk and typically requires ~13GB of RAM headroom during inference (model + activations).

Configuration & Overrides πŸ—‚οΈ

The CLI and server share the same configuration. On first launch we create a config.toml populated with defaults; later runs reuse it so both entrypoints stay in sync.

Platform Config file (default) Model cache root
Linux ~/.config/deepseek-ocr/config.toml ~/.cache/deepseek-ocr/models/<id>/…
macOS ~/Library/Application Support/deepseek-ocr/config.toml ~/Library/Caches/deepseek-ocr/models/<id>/…
Windows %APPDATA%\deepseek-ocr\config.toml %LOCALAPPDATA%\deepseek-ocr\models\<id>\…
  • Override the location with --config /path/to/config.toml (available on both CLI and server). Missing files are created automatically.
  • Each [models.entries."<id>"] record can point to custom config, tokenizer, or weights files. When omitted we fall back to the cache directory above and download/update assets as required.
  • Runtime values resolve in this order: command-line flags β†’ values stored in config.toml β†’ built-in defaults. The HTTP API adds a final layer where request payload fields (for example max_tokens) override everything else for that call.

The generated file starts with the defaults below; adjust them to persistently change behaviour:

[models]
active = "deepseek-ocr"

[models.entries.deepseek-ocr]

[inference]
device = "cpu"
template = "plain"
base_size = 1024
image_size = 640
crop_mode = true
max_new_tokens = 512
use_cache = true

[server]
host = "0.0.0.0"
port = 8000
model_id = "deepseek-ocr"
  • [models] picks the active model and lets you add more entries (each entry can point to its own config/tokenizer/weights).
  • [inference] controls notebook-friendly defaults shared by the CLI and server (device, template, vision sizing, decoding budget, cache usage).
  • [server] sets the network binding and the model identifier reported by /v1/models.

See crates/cli/README.md and crates/server/README.md for concise override tables.

Benchmark Snapshot πŸ“Š

Single-request Rust CLI (Accelerate backend on macOS) compared with the reference Python pipeline on the same prompt and image:

Stage ref total (ms) ref avg (ms) python total python/ref
Decode – Overall (decode.generate) 30077.840 30077.840 56554.873 1.88x
Decode – Token Loop (decode.iterative) 26930.216 26930.216 39227.974 1.46x
Decode – Prompt Prefill (decode.prefill) 3147.337 3147.337 5759.684 1.83x
Prompt – Build Tokens (prompt.build_tokens) 0.466 0.466 45.434 97.42x
Prompt – Render Template (prompt.render) 0.005 0.005 0.019 3.52x
Vision – Embed Images (vision.compute_embeddings) 6391.435 6391.435 3953.459 0.62x
Vision – Prepare Inputs (vision.prepare_inputs) 62.524 62.524 45.438 0.73x

Command-Line Interface πŸ–₯️

Build and run directly from the workspace:

cargo run -p deepseek-ocr-cli --release -- \
  --prompt "<image>\n<|grounding|>Convert this receipt to markdown." \
  --image baselines/sample/images/test.png \
  --device cpu --max-new-tokens 512

Tip: --release is required for reasonable throughput; debug builds can be 10x slower.

macOS tip: append --features metal to the cargo run/cargo build commands to compile with Accelerate + Metal backends.

CUDA tip (Linux/Windows): append --features cuda and run with --device cuda --dtype f16 to target NVIDIA GPUsβ€”feature is still alpha, so be ready for quirks.

Intel MKL preview: install Intel oneMKL, then build with --features mkl for faster CPU matmuls on x86.

Install the CLI as a binary:

cargo install --path crates/cli
deepseek-ocr-cli --help

Key flags:

  • --prompt / --prompt-file: text with <image> slots
  • --image: path(s) matching <image> placeholders
  • --device and --dtype: choose metal + f16 on Apple Silicon or cuda + f16 on NVIDIA GPUs
  • --max-new-tokens: decoding budget

HTTP Server ☁️

Launch an OpenAI-compatible endpoint:

cargo run -p deepseek-ocr-server --release -- \
  --host 0.0.0.0 --port 8000 \
  --device cpu --max-new-tokens 512

Keep --release on the server as well; the debug profile is far too slow for inference workloads. macOS tip: add --features metal to the cargo run -p deepseek-ocr-server command when you want the server binary to link against Accelerate + Metal (and pair it with --device metal at runtime).

CUDA tip: add --features cuda and start the server with --device cuda --dtype f16 to offload inference to NVIDIA GPUs (alpha-quality support).

Intel MKL preview: install Intel oneMKL before building with --features mkl to accelerate CPU workloads on x86.

Notes:

  • Use data: URLs or remote http(s) links; local paths are rejected.
  • The server collapses multi-turn chat inputs to the latest user message to keep prompts OCR-friendly.
  • Works out of the box with tools such as Open WebUI or any OpenAI-compatible clientβ€”just point the base URL to your server (http://localhost:8000/v1) and select the deepseek-ocr model.
  • Adjust the request body limit with Rocket config if you routinely send large images.

Open WebUI connected to deepseek-ocr.rs

GPU Acceleration ⚑

  • Metal (macOS 13+ Apple Silicon) – pass --device metal --dtype f16 and build binaries with --features metal so Candle links against Accelerate + Metal.
  • CUDA (alpha, NVIDIA GPUs) – install CUDA 12.2+ toolkits, build with --features cuda, and launch the CLI/server with --device cuda --dtype f16; still experimental.
  • Intel MKL (preview) – install Intel oneMKL and build with --features mkl to speed up CPU workloads on x86.
  • For either backend, prefer release builds (e.g. cargo build --release -p deepseek-ocr-cli --features metal|cuda) to maximise throughput.
  • Combine GPU runs with --max-new-tokens and crop tuning flags to balance latency vs. quality.

Repository Layout πŸ—‚οΈ

  • crates/core – shared inference pipeline, model loaders, conversation templates.
  • crates/cli – command-line frontend (deepseek-ocr-cli).
  • crates/server – Rocket server exposing OpenAI-compatible endpoints.
  • crates/assets – asset management (configuration, tokenizer, Hugging Face + ModelScope download helpers).
  • baselines/ – reference inputs and outputs for regression testing.

Detailed CLI usage lives in crates/cli/README.md. The server’s OpenAI-compatible interface is covered in crates/server/README.md.

Troubleshooting πŸ› οΈ

  • Where do assets come from? – downloads automatically pick between Hugging Face and ModelScope based on latency; the CLI prints the chosen source for each file.
  • Slow first response – model load and GPU warm-up (Metal/CUDA alpha) happen on the initial request; later runs are faster.
  • Large image rejection – increase Rocket JSON limits in crates/server/src/main.rs or downscale the input.

Roadmap πŸ—ΊοΈ

  • βœ… Apple Metal backend with FP16 support and CLI/server parity on macOS.
  • βœ… NVIDIA CUDA backend (alpha) – build with --features cuda, run with --device cuda --dtype f16 for Linux/Windows GPUs; polishing in progress.
  • πŸ”„ Parity polish – finish projector normalisation + crop tiling alignment; extend intermediate-tensor diff suite beyond the current sample baseline.
  • πŸ”„ Grounding & streaming – port the Python post-processing helpers (box extraction, markdown polish) and refine SSE streaming ergonomics.
  • πŸ”„ Cross-platform acceleration – continue tuning CUDA kernels, add automatic device detection across CPU/Metal/CUDA, and publish opt-in GPU benchmarks.
  • πŸ”„ Packaging & Ops – ship binary releases with deterministic asset checksums, richer logging/metrics, and Helm/docker references for server deploys.
  • πŸ”œ Structured outputs – optional JSON schema tools for downstream automation once parity gaps close.

License πŸ“„

This repository inherits the licenses of its dependencies and the upstream DeepSeek-OCR model. Refer to DeepSeek-OCR/LICENSE for model terms and apply the same restrictions to downstream use.