Inferflow is an efficient and highly configurable inference engine and service for large language models (LLMs). With Inferflow, users can serve most of the common transformer models by simply modifying some lines in corresponding configuration files, without writing a single line of source code.
- Extensible and highly configurable: A typical way of using Inferflow to serve a new model is editing a model specification file, but not adding/editing source codes. We implement in Inferflow a modular framework of atomic building-blocks and technologies, making it compositionally generalizable to new models. A new model can be served by Inferflow if the atomic building-blocks and technologies in this model have been "known" (to Inferflow).
- 3.5-bit quantization: Inferflow implements 2-bit, 3-bit, 3.5-bit, 4-bit, 5-bit, 6-bit and 8-bit quantization. Among the quantization schemes, 3.5-bit quantization is a new one introduced by Inferflow.
- Hybrid model partition for multi-GPU inference: Inferflow supports multi-GPU inference with three model partitioning strategies to choose from: partition-by-layer (pipeline parallelism), partition-by-tensor (tensor parallelism), and hybrid partitioning (hybrid parallelism). Hybrid partitioning is seldom supported by other inference engines.
- Wide file format support (and safely loading pickle data): Inferflow supports loading models of multiple file formats directly, without reliance on an external converter. Supported formats include pickle, safetensors, llama.cpp gguf, etc. It is known that there are security issues to read pickle files using Python codes. By implementing a simplified pickle parser in C++, Inferflow supports safely loading models from pickle data.
- Wide network type support: Supporting three types transformer models: decoder-only models, encoder-only models, and encoder-decoder models.
- GPU/CPU hybrid inference: Supporting GPU-only, CPU-only, and GPU/CPU hybrid inference.
Below is a comparison between Inferflow and some other inference engines:
Model | New Model Support | Supported File Formats | Network Structures | Quantization Bits | Hybrid Parallelism for Multi-GPU Inference | Programming Languages |
---|---|---|---|---|---|---|
Huggingface Transformers | Adding/editing source codes | pickle (unsafe), safetensors | decoder-only, encoder-decoder, encoder-only | 4b, 8b | ✘ | Python |
vLLM | Adding/editing source codes | pickle (unsafe), safetensors | decoder-only | 4b, 8b | ✘ | Python |
TensorRT-LLM | Adding/editing source codes | decoder-only, encoder-decoder, encoder-only | 4b, 8b | ✘ | C++, Python | |
DeepSpeed-MII | Adding/editing source codes | pickle (unsafe), safetensors | decoder-only | - | ✘ | Python |
llama.cpp | Adding/editing source codes | gguf | decoder-only | 2b, 3b, 4b, 5b, 6b, 8b | ✘ | C/C++ |
llama2.c | Adding/editing source codes | llama2.c | decoder-only | - | ✘ | C |
LMDeploy | Adding/editing source codes | pickle (unsafe), TurboMind | decoder-only | 4b, 8b | ✘ | C++, Python |
Inferflow | Editing configuration files | pickle (safe), safetensors, gguf, llama2.c | decoder-only, encoder-decoder, encoder-only | 2b, 3b, 3.5b, 4b, 5b, 6b, 8b | ✔ | C++ |
- Pickle (Inferflow reduces the security issue of most other inference engines in loading pickle-format files).
- Safetensors
- llama.cpp gguf
- llama2.c
-
Supported modules and technologies related to model definition:
- Normalization functions: STD, RMS
- Activation functions: RELU, GELU, SILU
- Position embeddings: ALIBI, RoPE, Sinusoidal
- Grouped-query attention
- Parallel attention
-
Supported technologies and options related to serving:
- Linear quantization of weights and KV cache elements: 2-bit, 3b, 3.5b, 4b, 5b, 6b, 8b
- The option of moving part of all of the KV cache from VRAM to regular RAM
- The option of placing the input embedding tensor(s) to regular RAM
- Model partitioning strategies for multi-GPU inference: partition-by-layer, partition-by-tensor, hybrid partitioning
- Dynamic batching
- Decoding strategies: Greedy, top-k, top-p, FSD, typical, mirostat...
- Decoder-only: Inferflow supports many types of decoder-only transformer models.
- Encoder-decoder: Some types of encoder-decoder models are supported.
- Encoder-only: Some types of encoder-only models are supported.
Users can serve a model with Inferflow by editing a model specification file. We have built predefined specification files for some popular or representative models. Below is a list of such models.
- Aquila (aquila_chat2_34b)
- Baichuan (baichuan2_7b_chat, baichuan2_13b_chat)
- BERT (bert-base-multilingual-cased)
- Bloom (bloomz_3b)
- Facebook m2m100 (facebook_m2m100_418m)
- Falcon (falcon_7b_instruct, falcon_40b_instruct)
- Internlm (internlm-chat-20b)
- LLAMA2 (llama2_7b, llama2_7b_chat, llama2_13b_chat)
- Mistral (mistral_7b_instruct)
- Open LLAMA (open_llama_3b)
- Phi-2 (phi_2)
- YI (yi_6b, yi_34b_chat)
git clone https://github.com/inferflow/inferflow
cd inferflow
-
Build with cmake on Linux, Mac, and WSL (Windows Subsystem for Linux):
-
Build the GPU version (that supports GPU/CPU hybrid inference):
mkdir build/gpu cd build/gpu cmake ../.. -DUSE_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=75 make install -j 8
-
Build the CPU-only version:
mkdir build/cpu cd build/cpu cmake ../.. -DUSE_CUDA=0 make install -j 8
Upon a successful build, executables are generated and copied to
bin/release/
-
-
Build with Visual Studio on Windows:
Open one of the following sln files in build/vs_projects:
- inferflow.sln: The GPU version that supports GPU/CPU hybrid inference
- inferflow_cpu.sln: The CPU-only version
For the Debug configuration, executables are generated to
bin/x64_Debug/
; while the output directory for the Release configuration isbin/x64_Release/
.
-
Example-1: Load a tiny model and perform inference
Step-1: Download the model
cd {inferflow-root-dir}/data/models/llama2.c/ bash download.sh
Step-2: Run the llm_inference tool:
cd {inferflow-root-dir}/bin/ release/llm_inference llm_inference.tiny.ini
-
Example-2: Run the llm_inference tool to load a larger model for inference
Step-1: Edit configuration file bin/llm_inference.ini to choose a model
Step-2: Download the selected model
cd {inferflow-root-dir}/data/models/{model-name}/ bash download.sh
Step-3: Run the tool:
cd {inferflow-root-dir}/bin/ release/llm_inference
-
Start the Inferflow service:
Step-1: Edit the service configuration file (bin/inferflow_service.ini)
Step-2: Start the service:
cd bin/release (on Windows: cd bin/x64_Release) ./inferflow_service
-
Run the Inferflow client (for interacting with the Inferflow service via HTTP protocol to get inference results):
Step-1: Edit the configuration file (bin/inferflow_client.ini)
Step-2: Run the client tool:
cd bin/release (on Windows: cd bin/x64_Release) ./inferflow_client
Inferflow is inspired by the awesome projects of llama.cpp and llama2.c. The CPU inference part of Inferflow is based on the ggml library. The FP16 data type in the CPU-only version of Inferflow is from the Half-precision floating-point library. We express our sincere gratitude to the maintainers and implementers of these source codes and tools.