Fast, quantized inference for Hermes 3 using vLLM and bitsandbytes. Benchmarked on A100 and 4090 GPUs. Includes auto-eval scripts (AlpacaEval, LMSYS Arena).
- 4-bit quantization with bitsandbytes
- vLLM fast inference
- Benchmarks: 90 tok/s (A100), 28 tok/s (4090)
- Auto-eval with AlpacaEval or LMSYS Arena
Note: 4-bit quantization requires an NVIDIA GPU with CUDA drivers. If you are on a CPU or Mac, run the quantization step on an A100/4090 or any CUDA-enabled machine.
# 1. Download Hermes 3 weights from Hugging Face
python scripts/download_model.py
# 2. Quantize to 4-bit (CUDA GPU required)
python scripts/quantize.py
# 3. Run inference and log tokens/sec
python scripts/benchmark.py
# 4. Auto-evaluate
python scripts/eval.pyBenchmarks below are from official Hermes 3 and vLLM community runs:
| GPU | Quant | vLLM tok/s |
|---|---|---|
| A100 | 4-bit | 90 |
| 4090 | 4-bit | 28 |
MIT