Exo Docker Container (CPU-only)

This container runs exo with tinygrad as the inference engine, optimized for CPU-only operation. It provides an OpenAI-compatible API for running large language models on CPU hardware.

Quick Start

# Build the container
docker build -t exo-cpu:latest .

# Run with default settings (using host network for UDP discovery)
docker run --net=host \
  -v /path/to/models:/data/exo/downloads \
  -v /path/to/config:/data/exo/config \
  -v /path/to/temp:/data/exo/temp \
  exo-cpu:latest

# Or with Podman
podman build -t exo-cpu:latest .
podman run --net=host \
  -v /path/to/models:/data/exo/downloads \
  -v /path/to/config:/data/exo/config \
  -v /path/to/temp:/data/exo/temp \
  exo-cpu:latest

Features

CPU-only operation with tinygrad optimization
OpenAI-compatible API endpoint (streaming and non-streaming)
No external dependencies (self-contained)
Works in disconnected environments
Compatible with Docker and Podman
Rootless operation with configurable UID/GID
Built-in healthcheck
API-focused (Web UI disabled by default)

Volume Mounts

Path	Description	Required	Purpose
`/data/exo/downloads`	Model storage location	Yes	Store model files that persist between container updates
`/data/exo/config`	Configuration files	No	Store customized config for exo
`/data/exo/temp`	Temporary files	No	Cache and temporary operations data

Environment Variables

Variable	Description	Default
`EXO_HOME`	Base directory for exo data	`/data/exo`
`TINYGRAD_DEBUG`	Debug level for tinygrad (0-7)	`0`
`DEBUG`	Debug level for exo (0-9)	`0`
`CPU`	Force CPU mode in tinygrad	`1`
`CLANG`	Use Clang compiler for tinygrad	`1`
`OMP_NUM_THREADS`	Number of OpenMP threads	`4`
`MKL_NUM_THREADS`	Number of MKL threads	`4`
`OPENBLAS_NUM_THREADS`	Number of OpenBLAS threads	`4`
`VECLIB_MAXIMUM_THREADS`	Number of VecLib threads (Apple)	`4`
`DISABLE_COMPILER_CACHE`	Disable tinygrad compiler cache	`0`
`PUID`	User ID to run as (for rootless)	`1000`
`PGID`	Group ID to run as (for rootless)	`1000`
`NODE_ID`	Custom node ID	Auto-generated
`HF_ENDPOINT`	Alternative Hugging Face endpoint	huggingface.co

CPU Optimization

This container is optimized for CPU-only operation:

Uses Clang compiler for best performance on modern CPUs
Includes optimized math libraries (OpenBLAS, BLAS, LAPACK)
Controls thread allocation for optimal performance via environment variables
Explicitly disables all GPU backends (CUDA, ROCm, Metal, AMD, NV)
Verifies tinygrad installation during build
Confirms CLANG and CPU environment variables are properly set
Pre-configured for high-performance CPU inference
Enables tinygrad compiler cache for better performance

Networking Options

Preferred: Host Network (Default)

By default, the container uses --net=host to allow UDP discovery to work properly:

docker run --net=host -v /path/to/models:/data/exo/downloads exo-cpu:latest

Alternative: Manual Discovery

For environments where --net=host is not possible, you can expose the following ports and use manual discovery:

docker run \
  -p 52415:52415 \
  -p 5678:5678/udp \
  -v /path/to/models:/data/exo/downloads \
  -v /path/to/config:/data/exo/config \
  exo-cpu:latest \
  python -m exo.main --inference-engine tinygrad --discovery-module manual \
  --discovery-config-path /data/exo/config/discovery.json

You'll need to create a manual discovery config file at /data/exo/config/discovery.json with content like:

{
  "nodes": [
    {
      "id": "node1",
      "address": "192.168.1.10",
      "port": 5678
    },
    {
      "id": "node2",
      "address": "192.168.1.11",
      "port": 5678
    }
  ]
}

Supported Model Types

Model Type	Endpoint	Examples
LLM	`/v1/chat/completions`	LLaMA-3.2-3B, LLaMA-3.1-8B, Phi-3, Qwen
Vision	`/v1/chat/completions`	LLaVA-1.5-7B (multimodal)
Text Generation	`/v1/completions`	Any LLM model supported by tinygrad

Running without root

The container supports running as a non-root user with configurable UID/GID:

docker run --net=host \
  -e PUID=1001 -e PGID=1001 \
  -v /path/to/models:/data/exo/downloads \
  exo-cpu:latest

This is especially important for security or when running in environments that enforce rootless containers. The entrypoint script uses gosu to drop privileges and run as the specified user.

Enabling the Web UI (Disabled by default)

The container runs without the web UI by default, focusing only on the API. To enable the web UI:

docker run --net=host \
  -v /path/to/models:/data/exo/downloads \
  exo-cpu:latest \
  python -m exo.main --inference-engine tinygrad

Usage Examples

Streaming API Request

curl http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3.2-3b",
     "messages": [{"role": "user", "content": "What is distributed inference?"}],
     "temperature": 0.7,
     "stream": true
   }'

Non-Streaming API Request

curl http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3.2-3b",
     "messages": [{"role": "user", "content": "What is distributed inference?"}],
     "temperature": 0.7
   }'

Performance Tuning

To optimize performance for your specific CPU, you can adjust the thread count environment variables:

docker run --net=host \
  -e OMP_NUM_THREADS=8 \
  -e MKL_NUM_THREADS=8 \
  -e OPENBLAS_NUM_THREADS=8 \
  -e VECLIB_MAXIMUM_THREADS=8 \
  -v /path/to/models:/data/exo/downloads \
  exo-cpu:latest

Setting these values to match your CPU core count can improve performance. For best results:

Set thread counts to match available physical cores
For machines with hyperthreading, try testing both physical core count and logical core count
For large models, consider setting DISABLE_COMPILER_CACHE=0 to enable JIT cache

Advanced Configuration Options

tinygrad Compiler Cache

tinygrad uses a just-in-time (JIT) compiler for CPU operations. By default, the container enables the compiler cache to improve performance. To disable it:

docker run --net=host \
  -e DISABLE_COMPILER_CACHE=1 \
  -v /path/to/models:/data/exo/downloads \
  exo-cpu:latest

Tailscale Discovery

If you're running in a Tailscale network, you can use the Tailscale discovery module:

docker run --net=host \
  -v /path/to/models:/data/exo/downloads \
  exo-cpu:latest \
  python -m exo.main --inference-engine tinygrad --discovery-module tailscale \
  --tailscale-api-key YOUR_API_KEY --tailnet-name YOUR_TAILNET

Custom Node ID

To specify a custom node ID for identification in the cluster:

docker run --net=host \
  -e NODE_ID=my-custom-node \
  -v /path/to/models:/data/exo/downloads \
  exo-cpu:latest

Debugging

To enable more verbose debugging, set the appropriate environment variables:

docker run --net=host \
  -e DEBUG=5 -e TINYGRAD_DEBUG=2 \
  -v /path/to/models:/data/exo/downloads \
  exo-cpu:latest

DEBUG values (0-9): Higher values show more exo debugging info
TINYGRAD_DEBUG values (0-7): Higher values show more tinygrad details including generated code
- 1: Lists devices being used
- 2: Provides performance metrics for operations
- 3: Outputs buffers used for each kernel
- 4: Outputs the generated kernel code
- 5+: Shows more detailed internal information

Building the Container

Build the Docker image with:

docker build -t exo-cpu .

Or with Podman:

podman build -t exo-cpu .

The build process:

Uses Python 3.12 slim Debian as the base image
Installs necessary build dependencies and libraries
Clones the exo repository
Installs exo and its dependencies including tinygrad
Sets up the rootless user capabilities
Configures the volumes and entry point for container execution

smb1982x/exo-docker