hermes3-fast-inference

Fast, quantized inference for Hermes 3 using vLLM and bitsandbytes. Benchmarked on A100 and 4090 GPUs. Includes auto-eval scripts (AlpacaEval, LMSYS Arena).

Features

4-bit quantization with bitsandbytes
vLLM fast inference
Benchmarks: 90 tok/s (A100), 28 tok/s (4090)
Auto-eval with AlpacaEval or LMSYS Arena

Quickstart

Note: 4-bit quantization requires an NVIDIA GPU with CUDA drivers. If you are on a CPU or Mac, run the quantization step on an A100/4090 or any CUDA-enabled machine.

# 1. Download Hermes 3 weights from Hugging Face
python scripts/download_model.py

# 2. Quantize to 4-bit (CUDA GPU required)
python scripts/quantize.py

# 3. Run inference and log tokens/sec
python scripts/benchmark.py

# 4. Auto-evaluate
python scripts/eval.py

Benchmarks

Benchmarks below are from official Hermes 3 and vLLM community runs:

GPU	Quant	vLLM tok/s
A100	4-bit	90
4090	4-bit	28

License

MIT

llSourcell/hermes3-fast-inference

hermes3-fast-inference

Features

Quickstart

Benchmarks

License