Osaurus 🦕

Native, Apple Silicon–only local LLM server. Built on Apple's MLX for maximum performance on M‑series chips. SwiftUI app + SwiftNIO server with OpenAI‑compatible and Ollama‑compatible endpoints.

Created by Dinoki Labs (dinoki.ai), a fully native desktop AI assistant and companion.

Highlights

Native MLX runtime: Optimized for Apple Silicon using MLX/MLXLLM
Apple Silicon only: Designed and tested for M‑series Macs
OpenAI API compatible: /v1/models and /v1/chat/completions (stream and non‑stream)
Ollama‑compatible: /chat endpoint with NDJSON streaming for OllamaKit and other Ollama clients
Function/Tool calling: OpenAI‑style tools + tool_choice, with tool_calls parsing and streaming deltas
Fast token streaming: Server‑Sent Events for low‑latency output
Model manager UI: Browse, download, and manage MLX models from mlx-community
System resource monitor: Real-time CPU and RAM usage visualization
Self‑contained: SwiftUI app with an embedded SwiftNIO HTTP server

Requirements

macOS 15.5+
Apple Silicon (M1 or newer)
Xcode 16.4+ (to build from source)

osaurus/
├── Core/
│   ├── AppDelegate.swift
│   └── osaurusApp.swift
├── Controllers/
│   ├── ServerController.swift      # NIO server lifecycle
│   └── ModelManager.swift          # Model discovery & downloads (Hugging Face)
├── Models/
│   ├── MLXModel.swift
│   ├── OpenAIAPI.swift             # OpenAI‑compatible DTOs
│   ├── ResponseWriters.swift       # SSE and NDJSON response writers
│   ├── ServerConfiguration.swift
│   └── ServerHealth.swift
├── Networking/
│   ├── HTTPHandler.swift           # Request parsing & routing entry
│   ├── Router.swift                # Routes → handlers with path normalization
│   └── AsyncHTTPHandler.swift      # Unified streaming handler
├── Services/
│   ├── MLXService.swift            # MLX loading, session caching, generation
│   ├── SearchService.swift
│   └── SystemMonitorService.swift  # Real-time CPU and RAM monitoring
├── Theme/
│   └── Theme.swift
├── Views/
│   ├── Components/SimpleComponents.swift
│   ├── ContentView.swift           # Start/stop server, quick controls
│   └── ModelDownloadView.swift     # Browse/download/manage models
└── Assets.xcassets/

Features

Native MLX text generation with model
Model manager with curated suggestions (Llama, Qwen, Gemma, Mistral, etc.)
Download sizes estimated via Hugging Face metadata
Streaming and non‑streaming chat completions
Multiple response formats: SSE (OpenAI‑style) and NDJSON (Ollama‑style)
Compatible with OllamaKit and other Ollama client libraries
OpenAI‑compatible function calling with robust parser for model outputs (handles code fences/formatting noise)
Auto‑detects stop sequences and BOS token from tokenizer configs
Health endpoint and simple status UI
Real-time system resource monitoring
Path normalization for API compatibility

Benchmarks

The following are 20-run averages from our batch benchmark suite. See raw results for details and variance.

Server	Model	TTFT avg (ms)	Total avg (ms)	Chars/s avg	TTFT rel	Total rel	Chars/s rel	Success
Osaurus	llama-3.2-3b-instruct-4bit	86	1314	558	0%	0%	0%	100%
Ollama	llama3.2	58	1655	434	+32%	-26%	-22%	100%
LM Studio	llama-3.2-3b-instruct	56	1203	610	+34%	+8%	+9%	100%

Metrics: TTFT = time-to-first-token, Total = time to final token, Chars/s = streaming throughput.
Relative % vs Osaurus baseline: TTFT/Total computed as 1 - other/osaurus; Chars/s as other/osaurus - 1. Positive = better.
Data sources: results/osaurus-vs-ollama-lmstudio-batch.summary.json, results/osaurus-vs-ollama-lmstudio-batch.results.csv.
How to reproduce: scripts/run_bench.sh calls scripts/benchmark_models.py to run prompts across servers and write results.

API Endpoints

GET / → Plain text status
GET /health → JSON health info
GET /models → OpenAI‑compatible models list
GET /tags → Ollama‑compatible models list
POST /chat/completions → OpenAI‑compatible chat completions
POST /chat → Ollama‑compatible chat endpoint

Path normalization: All endpoints support common API prefixes (/v1, /api, /v1/api). For example:

/v1/models → /models
/api/chat/completions → /chat/completions
/api/chat → /chat (Ollama‑style)

Getting Started

Download

Download the latest signed build from the Releases page.

Build and run

Open osaurus.xcodeproj in Xcode 16.4+
Build and run the osaurus target
In the UI, configure the port via the gear icon (default 8080) and press Start
Open the model manager to download a model (e.g., "Llama 3.2 3B Instruct 4bit")

Models are stored by default at ~/MLXModels. Override with the environment variable OSU_MODELS_DIR.

Use the API

Base URL: http://127.0.0.1:8080 (or your chosen port)

List models:

curl -s http://127.0.0.1:8080/v1/models | jq

Ollama‑compatible models list:

curl -s http://127.0.0.1:8080/v1/tags | jq

Non‑streaming chat completion:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role":"user","content":"Write a haiku about dinosaurs"}],
        "max_tokens": 200
      }'

Streaming chat completion (SSE format for /chat/completions):

curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role":"user","content":"Summarize Jurassic Park in one paragraph"}],
        "stream": true
      }'

Ollama‑compatible streaming (NDJSON format for /chat):

curl -N http://127.0.0.1:8080/v1/api/chat \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role":"user","content":"Tell me about dinosaurs"}],
        "stream": true
      }'

This endpoint is compatible with OllamaKit and other Ollama client libraries.

Tip: Model names are lower‑cased with hyphens (derived from the friendly name), for example: Llama 3.2 3B Instruct 4bit → llama-3.2-3b-instruct-4bit.

Integrate with native or Electron apps

If you're building a macOS app (Swift/Objective‑C/SwiftUI/Electron) and want to discover and connect to a running Osaurus instance, see the Shared Configuration guide: SHARED_CONFIGURATION_GUIDE.md.

Function/Tool Calling (OpenAI‑compatible)

Osaurus supports OpenAI‑style function calling. Send tools and optional tool_choice in your request. The model is instructed to reply with an exact JSON object containing tool_calls, and the server parses it, including common formatting like code fences.

Define tools and let the model decide (tool_choice: "auto"):

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [
          {"role":"system","content":"You can call functions to answer queries succinctly."},
          {"role":"user","content":"What\'s the weather in SF?"}
        ],
        "tools": [
          {
            "type": "function",
            "function": {
              "name": "get_weather",
              "description": "Get weather by city name",
              "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
              }
            }
          }
        ],
        "tool_choice": "auto"
      }'

Non‑stream response will include message.tool_calls and finish_reason: "tool_calls". Streaming responses emit OpenAI‑style deltas for tool_calls (id, type, function name, and chunked arguments), finishing with finish_reason: "tool_calls" and [DONE].

After you execute a tool, continue the conversation by sending a tool role message with tool_call_id:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [
          {"role":"user","content":"What\'s the weather in SF?"},
          {"role":"assistant","content":"","tool_calls":[{"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"city\":\"SF\"}"}}]},
          {"role":"tool","tool_call_id":"call_1","content":"{\"tempC\":18,\"conditions\":\"Foggy\"}"}
        ]
      }'

Notes:

Only type: "function" tools are supported.
Arguments must be a JSON‑escaped string in the assistant response; Osaurus also tolerates a nested parameters object and will normalize.
Parser accepts minor formatting noise like code fences and assistant: prefixes.

Use with OpenAI SDKs

Point your client at Osaurus and use any placeholder API key.

Python example:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")

resp = client.chat.completions.create(
    model="llama-3.2-3b-instruct-4bit",
    messages=[{"role": "user", "content": "Hello there!"}],
)

print(resp.choices[0].message.content)

Python with tools (non‑stream):

import json
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather by city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }
]

resp = client.chat.completions.create(
    model="llama-3.2-3b-instruct-4bit",
    messages=[{"role": "user", "content": "Weather in SF?"}],
    tools=tools,
    tool_choice="auto",
)

tool_calls = resp.choices[0].message.tool_calls or []
for call in tool_calls:
    args = json.loads(call.function.arguments)
    result = {"tempC": 18, "conditions": "Foggy"}  # your tool result
    followup = client.chat.completions.create(
        model="llama-3.2-3b-instruct-4bit",
        messages=[
            {"role": "user", "content": "Weather in SF?"},
            {"role": "assistant", "content": "", "tool_calls": tool_calls},
            {"role": "tool", "tool_call_id": call.id, "content": json.dumps(result)},
        ],
    )
    print(followup.choices[0].message.content)

CORS

Osaurus includes built‑in CORS support for browser clients.

Disabled by default: No CORS headers are sent unless you configure allowed origins.
Enable via UI: gear icon → Advanced Settings → CORS Settings → Allowed Origins.
- Enter a comma‑separated list, for example: http://localhost:3000, http://127.0.0.1:5173, https://app.example.com
- Use * to allow any origin (recommended only for local development).
Expose to network: If you need to access from other devices, also enable "Expose to network" in Network Settings.

Behavior when CORS is enabled:

Requests with an allowed Origin receive Access-Control-Allow-Origin (either the specific origin or *).
Preflight OPTIONS requests are answered with 204 No Content and headers:
- Access-Control-Allow-Methods: echoes requested method or defaults to GET, POST, OPTIONS, HEAD
- Access-Control-Allow-Headers: echoes requested headers or defaults to Content-Type, Authorization
- Access-Control-Max-Age: 600
Streaming endpoints also include CORS headers on their responses.

Quick examples

Configure via UI (persists to app settings). The underlying config includes:

{
  "allowedOrigins": ["http://localhost:3000", "https://app.example.com"]
}

Browser fetch from a web app running on http://localhost:3000:

await fetch("http://127.0.0.1:8080/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "llama-3.2-3b-instruct-4bit",
    messages: [{ role: "user", content: "Hello!" }],
  }),
});

Notes

Leave the field empty to disable CORS entirely.
* cannot be combined with credentials; Osaurus does not use cookies, so this is typically fine for local use.

Models

Curated suggestions include Llama, Qwen, Gemma, Mistral, Phi, DeepSeek, etc. (4‑bit variants for speed)
Discovery pulls from Hugging Face mlx-community and computes size estimates
Required files are fetched automatically (tokenizer/config/weights)
Change the models directory with OSU_MODELS_DIR

Notes & Limitations

Apple Silicon only (requires MLX); Intel Macs are not supported
Localhost only, no authentication; put behind a proxy if exposing externally
/transcribe endpoints are placeholders pending Whisper integration

Dependencies

SwiftNIO (HTTP server)
SwiftUI/AppKit (UI)
MLX‑Swift, MLXLLM (runtime and generation)

Contributors

wizardeur — first PR creator

Community

Join us on Discord
Read the Contributing Guide and our Code of Conduct
See our Security Policy for reporting vulnerabilities
Get help in Support
Pick up a good first issue or help wanted

If you find Osaurus useful, please ⭐ the repo and share it!