/llm-inference-benchmark

LLM Inference benchmark

Primary LanguagePythonMIT LicenseMIT

llm-inference-benchmark

LLM Inference benchmark

Inference frameworks

Framework Producibility**** Docker Image API Server OpenAI API Server WebUI Multi Models** Multi-node Backends Embedding Model
text-generation-webui Low Yes Yes Yes Yes No No Transformers/llama.cpp/ExLlama/ExLlamaV2/AutoGPTQ/AutoAWQ/GPTQ-for-LLaMa/CTransformers No
OpenLLM High Yes Yes Yes No With BentoML With BentoML Transformers(int8,int4,gptq), vLLM(awq/squeezellm), TensorRT No
vLLM* High Yes Yes Yes No No Yes(With Ray) vLLM No
Xinference High Yes Yes Yes Yes Yes Yes Transformers/vLLM/TensorRT/GGML Yes
TGI*** Medium Yes Yes No No No No Transformers/AutoGPTQ/AWQ/EETP/vLLM/ExLlama/ExLlamaV2 No
ScaleLLM Medium Yes Yes Yes Yes No No Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 No
FastChat High Yes Yes Yes Yes Yes Yes Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 Yes
  • *vLLM/TGI can also serve as a backend.
  • **Multi Models: Capable of loading multiple models simultaneously.
  • ***TGI does not support chat mode; manual parsing of the prompt is required.

Inference backends

Backend Device Compatibility** PEFT Adapters* Quatisation Batching Distributed Inference Streaming
Transformers GPU High Yes bitsandbytes(int8/int4), AutoGPTQ(gptq), AutoAWQ(awq) Yes accelerate Yes
vLLM GPU High No awq/squeezellm Yes Yes Yes
ExLlamaV2 GPU/CPU Low No GPTQ Yes Yes Yes
TensorRT GPU Medium No some models Yes Yes Yes
Candle GPU/CPU Low No No Yes Yes Yes
CTranslate2 GPU Low No Yes Yes Yes Yes
TGI GPU Medium Yes awq/eetq/gptq/bitsandbytes Yes Yes Yes
llama-cpp*** GPU/CPU High No GGUF/GPTQ Yes No Yes
lmdeploy GPU Medium No AWQ Yes Yes Yes
Deepspeed-FastGen GPU Low No No Yes Yes Yes
  • *PEFT Adapters: support to load seperate PEFT adapters(mostly lora).
  • **Compatibility: High: Compatible with most models; Medium: Compatible with some models; Low: Compatible with few models.
  • ***llama.cpp's Python binding: llama-cpp-python.

Benchmark

Hardware:

  • GPU: 1x NVIDIA RTX4090 24GB
  • CPU: Intel Core i9-13900K
  • Memory: 96GB

Software:

  • VM: WSL2 on Windows 11
  • Guest OS: Ubuntu 22.04
  • NVIDIA Driver Version: 536.67
  • CUDA Version: 12.2
  • PyTorch: 2.1.1

Model:

Data:

  • Prompt Length: 512 (with some random characters to avoid cache).
  • Max Tokens: 200.

Backend Benchmark

No Quantisation

Backend TPS@4 QPS@4 TPS@1 QPS@1 FTL@1
text-generation-webui Transformer 40.39 0.15 41.47 0.21 344.61
text-generation-webui Transformer with flash-attention-2 58.30 0.21 43.52 0.21 341.39
text-generation-webui ExllamaV2 69.09 0.26 50.71 0.27 564.80
OpenLLM PyTorch 60.79 0.22 44.73 0.21 514.55
TGI 192.58 0.90 59.68 0.28 82.72
vLLM 222.63 1.08 62.69 0.30 95.43
TensorRT - - - - -
CTranslate2* - - - - -
lmdeploy 236.03 1.15 67.86 0.33 76.81
  • bs: Batch Size. bs=4 indicates the batch size is 4.

  • TPS: Tokens Per Second.

  • QPS: Queries Per Second.

  • FTL: First Token Latency, measured in milliseconds. Applicable only in stream mode.

  • Encountered an error using CTranslate2 to convert Yi-6B-Chat. See details in the issue.

8Bit Quantisation

Backend TPS@4 QPS@4 TPS@1 QPS@1 FTL@1
TGI eetq 8bit 293.08 1.41 88.08 0.42 63.69
TGI GPTQ 8bit - - - - -
OpenLLM PyTorch AutoGPTQ 8bit 49.8 0.17 29.54 0.14 930.16
  • bitsandbytes is very slow (int8 6.8 tokens/s), so we don't benchmark it.
  • eetq-8bit doesn't require specific model.
  • TGI GPTQ 8bit load failed: Server error: module 'triton.compiler' has no attribute 'OutOfResources'
    • TGI GPTQ bit use exllama or triton backend.

4Bit Quantisation

Backend TPS@4 QPS@4 TPS@1 QPS@1 FTL@1
TGI AWQ 4bit 336.47 1.61 102.00 0.48 94.84
vLLM AWQ 4bit 29.03 0.14 37.48 0.19 3711.0
text-generation-webui llama-cpp GGUF 4bit 67.63 0.37 56.65 0.34 331.57