Open Source LLM Inference Engines

Overview of popular open source large language model inference engines. An inference engine is the program which loads a models weights and generates text responses based on given inputs.

Feel free to create a PR or issue if you want a new engine column, feature row, or update a status.

Compared Inference Engines

vLLM: Designed to provide SOTA throughput.
TensorRT-LLM: Nvidias design for a high performance extensible pytorch-like API for use with Nvidia Triton Inference Server.
llama.cpp: Pure C++ without any dependencies, with Apple Silicon prioritized.
TGI: HuggingFace' fast and flexible engine designed for high throughput.
LightLLM: Lightweight, fast and flexible framework targeting performance, written purely in Python / Triton.
DeepSpeed-MII / DeepSpeed-FastGen: Microsofts high performance implementation including SOTA Dynamic Splitfuse
ExLlamaV2: Efficiently run language models on modern consumer GPUs. Implements SOTA quantization method, EXL2.

Comparison Table

	vLLM	TensorRT-LLM	llama.cpp	TGI	LightLLM	Fastgen	ExLlamaV2
Optimizations
FlashAttention2	✅ ¹	✅ ²	🟠 ³	✅ ⁴	✅	✅	✅
PagedAttention	✅ ⁴	✅ ²	❌ ⁵	✅	🟠*** ⁶	✅	✅ ⁷
Speculative Decoding	🔨 ⁸	🗓️ ⁹	✅ ¹⁰	✅ ¹¹	❌	❌ ¹²	✅
Tensor Parallel	✅	✅ ¹³	🟠** ¹⁴	✅ ¹⁵	✅	✅ ¹⁶	❌
Pipeline Parallel	✅ ¹⁷	✅ ¹⁸	❌ ¹⁹	❓ ¹⁵	❌	❌ ²⁰	❌
Optim. / Scheduler
Dyn. SplitFuse (SOTA²¹)	🗓️ ²¹	🗓️ ²²	❌	❌	❌	✅ ²¹	❌
Efficient Rtr (better)	❌	❌	❌	❌	✅ ²³	❌	❌
Cont. Batching	✅ ²¹	✅ ²⁴	✅	✅	❌	✅ ¹⁶	❓ ²⁵
Optim. / Quant
EXL2 (SOTA²⁶)	🔨 ²⁷	❌	❌	✅ ²⁸	❌	❌	✅
AWQ	🌩️ ²⁹	✅	❌	✅	❌	❌	❌
Other Quants	(yes) ³⁰	GPTQ	GGUF ³¹	(yes) ³²	?	?	?
Features
OpenAI-Style API	✅	❌ ³³	✅ [^13]	✅ ³⁴	✅ ³⁵	❌	❌
Feat. / Sampling
Beam Search	✅	✅ ²	✅ ³⁶	🟠**** ³⁷	❌	❌ ³⁸	❌ ³⁹
JSON / Grammars via Outlines	✅	🗓️	✅	✅	?	?	✅
Models
Llama 2 / 3	✅	✅	✅	✅	✅	✅	✅
Mistral	✅	✅	✅	✅	✅ ⁴⁰	✅	✅
Mixtral	✅	✅	✅	✅	✅	✅	✅
Implementation
Core Language	Python	C++	C++	Py / Rust	Python	Python	Python
GPU API	CUDA*	CUDA*	Metal / CUDA	CUDA*	Triton / CUDA	CUDA*	CUDA
Repo
License	Apache 2	Apache 2	MIT	Apache 2 ⁴¹	Apache 2	Apache 2	MIT
Github Stars	17K	6K	54K	8K	2K	2K	3K

Benchmarks

BentoML (June 5th, 2024): Compares LMDeploy, MLC-LLM, TGI, TRT-LLM, vLLM

Notes

*Supports Triton for one-off such as FlashAttention (FusedAttention) / quantization, or allows Triton plugins, however the project doesn't use Triton otherwise.

**Sequentially processed tensor split

***"TokenAttention is the special case of PagedAttention when block size equals to 1, which we have tested before and find it under-utilizes GPU compute compared to larger block size. Unless LightLLM's Triton kernel implementation is surprisingly fast, this should not bring speedup."

****TGI maintainers suggest using best_of instead of beam search. (best_of creates n generations and selects the one with the lowest logprob). Anecdotally, beam search is much better at finding the best generation for "non-creative" tasks.

lapp0/lm-inference-engines

Open Source LLM Inference Engines

Compared Inference Engines

Comparison Table

Benchmarks

Notes

Footnotes