Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

🏭Architecture | 💬NeuralChat | 😃Inference | 💻Examples | 📖Documentations

🚀Latest News

NeuralChat, a customizable chatbot framework under Intel® Extension for Transformers, is now available for you to create your own chatbot within minutes! It supports a rich set of plugins Knowledge Retrieval, Speech Interaction, Query Caching, Security Guardrail, and multiple architectures such as Intel® Xeon® Scalable Processors and Habana Gaudi® Accelerator. Check out the below sample code and have a try now!

# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

💬NeuralChat v1.1, a fine-tuned chat model based on MPT-7B using a mixed set of instruction datasets, is available on Hugging Face, together with the release of INT8 quantization recipes and benchmark results.

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For more installation method, please refer to Installation Page

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:

Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5 and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins and SOTA optimizations
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels, supporting GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B and Dolly-v2-3B

🌱Getting Started

Sentiment Analysis with Quantization

Prepare Dataset

from datasets import load_dataset, load_metric
from transformers import AutoConfig,AutoModelForSequenceClassification,AutoTokenizer

raw_datasets = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
raw_datasets = raw_datasets.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)

Quantization

from intel_extension_for_transformers.transformers import QuantizationConfig, metrics, objectives
from intel_extension_for_transformers.transformers.trainer import NLPTrainer

config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",config=config)
model.config.label2id = {0: 0, 1: 1}
model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'}
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(model=model, 
    train_dataset=raw_datasets["train"], 
    eval_dataset=raw_datasets["validation"],
    tokenizer=tokenizer
)
q_config = QuantizationConfig(metrics=[metrics.Metric(name="eval_loss", greater_is_better=False)])
model = trainer.quantize(quant_config=q_config)

input = tokenizer("I like Intel Extension for Transformers", return_tensors="pt")
output = model(**input).logits.argmax().item()

For more quick samples, please refer to Get Started Page. For more validated examples, please refer to Support Model Matrix

🎯Validated Performance

Model	FP32	BF16	INT8
EleutherAI/gpt-j-6B	4163.67 (ms)	1879.61 (ms)	1612.24 (ms)
CompVis/stable-diffusion-v1-4	10.33 (s)	3.02 (s)	N/A

Note*: GPT-J-6B software/hardware configuration please refer to text-generation. Stable-diffusion software/hardware configuration please refer to text-to-image

📖Documentation

OVERVIEW
Model Compression	NeuralChat	Neural Engine	Kernel Libraries
MODEL COMPRESSION
Quantization	Pruning	Distillation	Orchestration
Neural Architecture Search	Export	Metrics/Objectives	Pipeline
NEURAL ENGINE
Model Compilation	Custom Pattern	Deployment	Profiling
KERNEL LIBRARIES
Sparse GEMM Kernels	Custom INT8 Kernels	Profiling	Benchmark
ALGORITHMS
Length Adaptive		Data Augmentation
TUTORIALS AND RESULTS
Tutorials	Supported Models	Model Performance	Kernel Performance

📃Selected Publications/Events

Keynote: Intel Innovation 2023 Livestream - Day2 (Sep 2023)
Blog published on Medium: NeuralChat: A Customizable Chatbot Framework (Sep 2023)
Blog published on Medium: Faster Stable Diffusion Inference with Intel Extension for Transformers (July 2023)
Blog of Intel Developer News: The Moat Is Trust, Or Maybe Just Responsible AI (July 2023)
Blog of Intel Developer News: Create Your Own Custom Chatbot (July 2023)
Blog of Intel Developer News: Accelerate Llama 2 with Intel AI Hardware and Software Optimizations (July 2023)
Arxiv: An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (June 2023)
Blog published on Medium: Simplify Your Custom Chatbot Deployment (June 2023)

View Full Publication List.

Additional Content

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us and look forward to our collaborations on Intel Extension for Transformers!

yiliu30/intel-extension-for-transformers