Intel® Extension for Transformers

An innovative toolkit to accelerate Transformer-based models on Intel platforms

Architecture | NeuralChat | Examples | Documentations

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:

Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5 and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
NeuralChat, a custom Chatbot trained on Intel CPUs through parameter-efficient fine-tuning PEFT on domain knowledge

Installation

Install from Pypi

pip install intel-extension-for-transformers

For more installation method, please refer to Installation Page

Getting Started

Sentiment Analysis with Quantization

Prepare Dataset

from datasets import load_dataset, load_metric
from transformers import AutoConfig,AutoModelForSequenceClassification,AutoTokenizer

raw_datasets = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
raw_datasets = raw_datasets.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)

Quantization

from intel_extension_for_transformers.optimization import QuantizationConfig, metrics, objectives
from intel_extension_for_transformers.optimization.trainer import NLPTrainer

config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",config=config)
model.config.label2id = {0: 0, 1: 1}
model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'}
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(model=model, 
    train_dataset=raw_datasets["train"], 
    eval_dataset=raw_datasets["validation"],
    tokenizer=tokenizer
)
q_config = QuantizationConfig(metrics=[metrics.Metric(name="eval_loss", greater_is_better=False)])
model = trainer.quantize(quant_config=q_config)

input = tokenizer("I like Intel Extension for Transformers", return_tensors="pt")
output = model(**input).logits.argmax().item()

For more quick samples, please refer to Get Started Page. For more validated examples, please refer to Support Model Matrix

Documentation

OVERVIEW
Model Compression	NeuralChat	Neural Engine	Kernel Libraries
MODEL COMPRESSION
Quantization	Pruning	Distillation	Orchestration
Neural Architecture Search	Export	Metrics/Objectives	Pipeline
NEURAL ENGINE
Model Compilation	Custom Pattern	Deployment	Profiling
KERNEL LIBRARIES
Sparse GEMM Kernels	Custom INT8 Kernels	Profiling	Benchmark
ALGORITHMS
Length Adaptive		Data Augmentation
TUTORIALS AND RESULTS
Tutorials	Supported Models	Model Performance	Kernel Performance

Selected Publications/Events

Blog published on Medium: Create Your Own Custom Chatbot (April 2023)
Blog of Tech-Innovation Artificial-Intelligence(AI): Intel® Xeon® Processors Are Still the Only CPU With MLPerf Results, Raising the Bar By 5x - Intel Communities (April 2023)
Blog published on Medium: MLefficiency — Optimizing transformer models for efficiency (Dec 2022)
NeurIPS'2022: Fast Distilbert on CPUs (Nov 2022)
NeurIPS'2022: QuaLA-MiniLM: a Quantized Length Adaptive MiniLM (Nov 2022)
Blog published by Cohere: Top NLP Papers—November 2022 (Nov 2022)
Blog published by Alibaba: Deep learning inference optimization for Address Purification (Aug 2022)
NeurIPS'2021: Prune Once for All: Sparse Pre-Trained Language Models (Nov 2021)

Zhenzhong1/intel-extension-for-transformers