hermes3-fast-inference

Fast, quantized inference for Hermes 3 using vLLM and bitsandbytes. Benchmarked on A100 and 4090 GPUs. Includes auto-eval scripts (AlpacaEval, LMSYS Arena).

Features

  • 4-bit quantization with bitsandbytes
  • vLLM fast inference
  • Benchmarks: 90 tok/s (A100), 28 tok/s (4090)
  • Auto-eval with AlpacaEval or LMSYS Arena

Quickstart

Note: 4-bit quantization requires an NVIDIA GPU with CUDA drivers. If you are on a CPU or Mac, run the quantization step on an A100/4090 or any CUDA-enabled machine.

# 1. Download Hermes 3 weights from Hugging Face
python scripts/download_model.py

# 2. Quantize to 4-bit (CUDA GPU required)
python scripts/quantize.py

# 3. Run inference and log tokens/sec
python scripts/benchmark.py

# 4. Auto-evaluate
python scripts/eval.py

Benchmarks

Benchmarks below are from official Hermes 3 and vLLM community runs:

GPU Quant vLLM tok/s
A100 4-bit 90
4090 4-bit 28

License

MIT