This container runs exo with tinygrad as the inference engine, optimized for CPU-only operation. It provides an OpenAI-compatible API for running large language models on CPU hardware.
# Build the container
docker build -t exo-cpu:latest .
# Run with default settings (using host network for UDP discovery)
docker run --net=host \
-v /path/to/models:/data/exo/downloads \
-v /path/to/config:/data/exo/config \
-v /path/to/temp:/data/exo/temp \
exo-cpu:latest
# Or with Podman
podman build -t exo-cpu:latest .
podman run --net=host \
-v /path/to/models:/data/exo/downloads \
-v /path/to/config:/data/exo/config \
-v /path/to/temp:/data/exo/temp \
exo-cpu:latest- CPU-only operation with tinygrad optimization
- OpenAI-compatible API endpoint (streaming and non-streaming)
- No external dependencies (self-contained)
- Works in disconnected environments
- Compatible with Docker and Podman
- Rootless operation with configurable UID/GID
- Built-in healthcheck
- API-focused (Web UI disabled by default)
| Path | Description | Required | Purpose |
|---|---|---|---|
/data/exo/downloads |
Model storage location | Yes | Store model files that persist between container updates |
/data/exo/config |
Configuration files | No | Store customized config for exo |
/data/exo/temp |
Temporary files | No | Cache and temporary operations data |
| Variable | Description | Default |
|---|---|---|
EXO_HOME |
Base directory for exo data | /data/exo |
TINYGRAD_DEBUG |
Debug level for tinygrad (0-7) | 0 |
DEBUG |
Debug level for exo (0-9) | 0 |
CPU |
Force CPU mode in tinygrad | 1 |
CLANG |
Use Clang compiler for tinygrad | 1 |
OMP_NUM_THREADS |
Number of OpenMP threads | 4 |
MKL_NUM_THREADS |
Number of MKL threads | 4 |
OPENBLAS_NUM_THREADS |
Number of OpenBLAS threads | 4 |
VECLIB_MAXIMUM_THREADS |
Number of VecLib threads (Apple) | 4 |
DISABLE_COMPILER_CACHE |
Disable tinygrad compiler cache | 0 |
PUID |
User ID to run as (for rootless) | 1000 |
PGID |
Group ID to run as (for rootless) | 1000 |
NODE_ID |
Custom node ID | Auto-generated |
HF_ENDPOINT |
Alternative Hugging Face endpoint | huggingface.co |
This container is optimized for CPU-only operation:
- Uses Clang compiler for best performance on modern CPUs
- Includes optimized math libraries (OpenBLAS, BLAS, LAPACK)
- Controls thread allocation for optimal performance via environment variables
- Explicitly disables all GPU backends (CUDA, ROCm, Metal, AMD, NV)
- Verifies tinygrad installation during build
- Confirms CLANG and CPU environment variables are properly set
- Pre-configured for high-performance CPU inference
- Enables tinygrad compiler cache for better performance
By default, the container uses --net=host to allow UDP discovery to work properly:
docker run --net=host -v /path/to/models:/data/exo/downloads exo-cpu:latestFor environments where --net=host is not possible, you can expose the following ports and use manual discovery:
docker run \
-p 52415:52415 \
-p 5678:5678/udp \
-v /path/to/models:/data/exo/downloads \
-v /path/to/config:/data/exo/config \
exo-cpu:latest \
python -m exo.main --inference-engine tinygrad --discovery-module manual \
--discovery-config-path /data/exo/config/discovery.jsonYou'll need to create a manual discovery config file at /data/exo/config/discovery.json with content like:
{
"nodes": [
{
"id": "node1",
"address": "192.168.1.10",
"port": 5678
},
{
"id": "node2",
"address": "192.168.1.11",
"port": 5678
}
]
}| Model Type | Endpoint | Examples |
|---|---|---|
| LLM | /v1/chat/completions |
LLaMA-3.2-3B, LLaMA-3.1-8B, Phi-3, Qwen |
| Vision | /v1/chat/completions |
LLaVA-1.5-7B (multimodal) |
| Text Generation | /v1/completions |
Any LLM model supported by tinygrad |
The container supports running as a non-root user with configurable UID/GID:
docker run --net=host \
-e PUID=1001 -e PGID=1001 \
-v /path/to/models:/data/exo/downloads \
exo-cpu:latestThis is especially important for security or when running in environments that enforce rootless containers. The entrypoint script uses gosu to drop privileges and run as the specified user.
The container runs without the web UI by default, focusing only on the API. To enable the web UI:
docker run --net=host \
-v /path/to/models:/data/exo/downloads \
exo-cpu:latest \
python -m exo.main --inference-engine tinygradcurl http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b",
"messages": [{"role": "user", "content": "What is distributed inference?"}],
"temperature": 0.7,
"stream": true
}'curl http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b",
"messages": [{"role": "user", "content": "What is distributed inference?"}],
"temperature": 0.7
}'To optimize performance for your specific CPU, you can adjust the thread count environment variables:
docker run --net=host \
-e OMP_NUM_THREADS=8 \
-e MKL_NUM_THREADS=8 \
-e OPENBLAS_NUM_THREADS=8 \
-e VECLIB_MAXIMUM_THREADS=8 \
-v /path/to/models:/data/exo/downloads \
exo-cpu:latestSetting these values to match your CPU core count can improve performance. For best results:
- Set thread counts to match available physical cores
- For machines with hyperthreading, try testing both physical core count and logical core count
- For large models, consider setting
DISABLE_COMPILER_CACHE=0to enable JIT cache
tinygrad uses a just-in-time (JIT) compiler for CPU operations. By default, the container enables the compiler cache to improve performance. To disable it:
docker run --net=host \
-e DISABLE_COMPILER_CACHE=1 \
-v /path/to/models:/data/exo/downloads \
exo-cpu:latestIf you're running in a Tailscale network, you can use the Tailscale discovery module:
docker run --net=host \
-v /path/to/models:/data/exo/downloads \
exo-cpu:latest \
python -m exo.main --inference-engine tinygrad --discovery-module tailscale \
--tailscale-api-key YOUR_API_KEY --tailnet-name YOUR_TAILNETTo specify a custom node ID for identification in the cluster:
docker run --net=host \
-e NODE_ID=my-custom-node \
-v /path/to/models:/data/exo/downloads \
exo-cpu:latestTo enable more verbose debugging, set the appropriate environment variables:
docker run --net=host \
-e DEBUG=5 -e TINYGRAD_DEBUG=2 \
-v /path/to/models:/data/exo/downloads \
exo-cpu:latestDEBUGvalues (0-9): Higher values show more exo debugging infoTINYGRAD_DEBUGvalues (0-7): Higher values show more tinygrad details including generated code1: Lists devices being used2: Provides performance metrics for operations3: Outputs buffers used for each kernel4: Outputs the generated kernel code5+: Shows more detailed internal information
Build the Docker image with:
docker build -t exo-cpu .Or with Podman:
podman build -t exo-cpu .The build process:
- Uses Python 3.12 slim Debian as the base image
- Installs necessary build dependencies and libraries
- Clones the exo repository
- Installs exo and its dependencies including tinygrad
- Sets up the rootless user capabilities
- Configures the volumes and entry point for container execution