Inferflow

Inferflow is an efficient and highly configurable inference engine and service for large language models (LLMs). With Inferflow, users can serve most of the common transformer models by simply modifying some lines in corresponding configuration files, without writing a single line of source code.

Main Features

Extensible and highly configurable: A typical way of using Inferflow to serve a new model is editing a model specification file, but not adding/editing source codes. We implement in Inferflow a modular framework of atomic building-blocks and technologies, making it compositionally generalizable to new models. A new model can be served by Inferflow if the atomic building-blocks and technologies in this model have been "known" (to Inferflow).
3.5-bit quantization: Inferflow implements 2-bit, 3-bit, 3.5-bit, 4-bit, 5-bit, 6-bit and 8-bit quantization. Among the quantization schemes, 3.5-bit quantization is a new one introduced by Inferflow.
Hybrid model partition for multi-GPU inference: Inferflow supports multi-GPU inference with three model partitioning strategies to choose from: partition-by-layer (pipeline parallelism), partition-by-tensor (tensor parallelism), and hybrid partitioning (hybrid parallelism). Hybrid partitioning is seldom supported by other inference engines.
Wide file format support (and safely loading pickle data): Inferflow supports loading models of multiple file formats directly, without reliance on an external converter. Supported formats include pickle, safetensors, llama.cpp gguf, etc. It is known that there are security issues to read pickle files using Python codes. By implementing a simplified pickle parser in C++, Inferflow supports safely loading models from pickle data.
Wide network type support: Supporting three types transformer models: decoder-only models, encoder-only models, and encoder-decoder models.
GPU/CPU hybrid inference: Supporting GPU-only, CPU-only, and GPU/CPU hybrid inference.

Below is a comparison between Inferflow and some other inference engines:

Model	New Model Support	Supported File Formats	Network Structures	Quantization Bits	Hybrid Parallelism for Multi-GPU Inference	Programming Languages
Huggingface Transformers	Adding/editing source codes	pickle (unsafe), safetensors	decoder-only, encoder-decoder, encoder-only	4b, 8b	✘	Python
vLLM	Adding/editing source codes	pickle (unsafe), safetensors	decoder-only	4b, 8b	✘	Python
TensorRT-LLM	Adding/editing source codes		decoder-only, encoder-decoder, encoder-only	4b, 8b	✘	C++, Python
DeepSpeed-MII	Adding/editing source codes	pickle (unsafe), safetensors	decoder-only	-	✘	Python
llama.cpp	Adding/editing source codes	gguf	decoder-only	2b, 3b, 4b, 5b, 6b, 8b	✘	C/C++
llama2.c	Adding/editing source codes	llama2.c	decoder-only	-	✘	C
LMDeploy	Adding/editing source codes	pickle (unsafe), TurboMind	decoder-only	4b, 8b	✘	C++, Python
Inferflow	Editing configuration files	pickle (safe), safetensors, gguf, llama2.c	decoder-only, encoder-decoder, encoder-only	2b, 3b, 3.5b, 4b, 5b, 6b, 8b	✔	C++

Support Matrix

Supported Model File Formats

Pickle (Inferflow reduces the security issue of most other inference engines in loading pickle-format files).
Safetensors
llama.cpp gguf
llama2.c

Supported Technologies, Modules, and Options

Supported modules and technologies related to model definition:
- Normalization functions: STD, RMS
- Activation functions: RELU, GELU, SILU
- Position embeddings: ALIBI, RoPE, Sinusoidal
- Grouped-query attention
- Parallel attention
Supported technologies and options related to serving:
- Linear quantization of weights and KV cache elements: 2-bit, 3b, 3.5b, 4b, 5b, 6b, 8b
- The option of moving part of all of the KV cache from VRAM to regular RAM
- The option of placing the input embedding tensor(s) to regular RAM
- Model partitioning strategies for multi-GPU inference: partition-by-layer, partition-by-tensor, hybrid partitioning
- Dynamic batching
- Decoding strategies: Greedy, top-k, top-p, FSD, typical, mirostat...

Supported Transformer Models

Decoder-only: Inferflow supports many types of decoder-only transformer models.
Encoder-decoder: Some types of encoder-decoder models are supported.
Encoder-only: Some types of encoder-only models are supported.

Models with Predefined Specification Files

Users can serve a model with Inferflow by editing a model specification file. We have built predefined specification files for some popular or representative models. Below is a list of such models.

Getting Started

Get the Code

git clone https://github.com/inferflow/inferflow
cd inferflow

Build

Build with cmake on Linux, Mac, and WSL (Windows Subsystem for Linux):
- Build the GPU version (that supports GPU/CPU hybrid inference):
```
mkdir build/gpu
cd build/gpu
cmake ../.. -DUSE_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=75
make install -j 8
```
- Build the CPU-only version:
```
mkdir build/cpu
cd build/cpu
cmake ../.. -DUSE_CUDA=0
make install -j 8
```
Upon a successful build, executables are generated and copied to bin/release/
Build with Visual Studio on Windows:

Open one of the following sln files in build/vs_projects:
- inferflow.sln: The GPU version that supports GPU/CPU hybrid inference
- inferflow_cpu.sln: The CPU-only version
For the Debug configuration, executables are generated to bin/x64_Debug/; while the output directory for the Release configuration is bin/x64_Release/.

Run the Service and Tools

Example-1: Load a tiny model and perform inference

Step-1: Download the model

cd {inferflow-root-dir}/data/models/llama2.c/
bash download.sh

Step-2: Run the llm_inference tool:

cd {inferflow-root-dir}/bin/
release/llm_inference llm_inference.tiny.ini

Example-2: Run the llm_inference tool to load a larger model for inference

Step-1: Edit configuration file bin/llm_inference.ini to choose a model

Step-2: Download the selected model
```
cd {inferflow-root-dir}/data/models/{model-name}/
bash download.sh
```
Step-3: Run the tool:
```
cd {inferflow-root-dir}/bin/
release/llm_inference
```
Start the Inferflow service:

Step-1: Edit the service configuration file (bin/inferflow_service.ini)

Step-2: Start the service:
```
cd bin/release (on Windows: cd bin/x64_Release)
./inferflow_service
```
Run the Inferflow client (for interacting with the Inferflow service via HTTP protocol to get inference results):

Step-1: Edit the configuration file (bin/inferflow_client.ini)

Step-2: Run the client tool:
```
cd bin/release (on Windows: cd bin/x64_Release)
./inferflow_client
```

Acknowledgements

Inferflow is inspired by the awesome projects of llama.cpp and llama2.c. The CPU inference part of Inferflow is based on the ggml library. The FP16 data type in the CPU-only version of Inferflow is from the Half-precision floating-point library. We express our sincere gratitude to the maintainers and implementers of these source codes and tools.

tencent-ailab/inferflow