Neural Speed

Neural Speed is an innovation library designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization and sparsity powered by Intel Neural Compressor and llama.cpp. Highlights of this project:

Support LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder
Highly optimized low precision kernels, utilize AMX, VNNI, AVX512F, AVX_VNNI and AVX2 instruction set
Up to 40x compared with llama.cpp, performance details: blog
NeurIPS' 2023: Efficient LLM Inference on CPUs
Support 4bits and 8bits quantization
Tensor Parallelism across sockets/nodes: tensor_parallelism.md

Neural Speed is under active development so APIs are subject to change.

Installation

Build Python package (Recommended way)

pip install -r requirements.txt
pip install .

Note: Please make sure GCC version is higher than GCC 10.

Quick Start

There are two approaches for utilizing the Neural Speed: 1. Transformer-like usage, you need to install ITREX(intel extension for transformers) 2. llama.cpp-like usage

1. Transformer-like usage

Pytorch format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

GGUF format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Please refer this link to check supported models.

If you want to use Transformer-based API in ITREX(Intel extension for transformers). Please refer to ITREX Installation Page.

2. llama.cpp-like usage:

One-click Python scripts

Run LLM with one-click python script including conversion, quantization and inference.

python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"

Quantize and Inference Step By Step

Neural Speed supports 1. GGUF models generated by llama.cpp 2. GGUF models from HuggingFace 3. PyTorch model from HuggingFace, but quantized by Neural Speed Neural Speed offers the scripts: 1) convert and quantize, and 2) inference for conveting the model by yourself. If the GGUF model is from HuggingFace or generated by llama.cpp, you can inference it directly.

1. Convert and Quantize LLM

converting the model by following the below steps:

# convert the model directly use model id in Hugging Face. (recommended)
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b

2. Inference

Linux and WSL

OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"

Windows

python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"

For details please refer to Advanced Usage.

Supported Hardware

Hardware	Optimization
Intel Xeon Scalable Processors	✔
Intel Xeon CPU Max Series	✔
Intel Core Processors	✔

Supported Models

LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder. You find find more deatils such as validated GGUF models from HuggingFace in list.

Neural Speed also supports GGUF models generated by llama.cpp, you need to download the model and use llama.cpp to create it. Validated models: llama2-7b-chat-hf, falcon-7b, falcon-40b, mpt-7b, mpt-40b and bloom-7b1.

Advanced Usage

1. Quantization and inferenece

More parameters in llama.cpp-like usage: Advanced Usage.

2. Tensor Parallelism cross nodes/sockets

We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket. You can refer to tensor_parallelism.md to enable this feature.

3. Custom Stopping Criteria

You can customize the stopping criteria according to your own needs by processing the input_ids to determine if text generation needs to be stopped. Here is the document of Custom Stopping Criteria: simple example with minimum generation length of 80 tokens

4. Verbose Mode

Enable verbose mode and control tracing information using the NEURAL_SPEED_VERBOSE environment variable.

Available modes:

0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling. (need to set NS_PROFILING to ON and recompile)
1: Print evaluation time. Time taken for each evaluation.
2: Profile individual operator. Identify performance bottleneck within the model. (need to set NS_PROFILING to ON and recompile)

Enable New Model

You can consider adding your own models, please follow the document: graph developer document.

parvizmp/neural-speed