/intel-extension-for-transformers

Extending Hugging Face transformers APIs for Transformer-based models and improve the productivity of inference deployment. With extremely compressed models, the toolkit can greatly improve the inference efficiency on Intel platforms.

Primary LanguageC++Apache License 2.0Apache-2.0

Intel® Extension for Transformers: Accelerating Transformer-based Models on Intel Platforms

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the key features and examples as below:

Installation

Install from Pypi

pip install intel-extension-for-transformers

For more installation method, please refer to Installation Page

Getting Started

Sentiment Analysis with Quantization

Prepare Dataset

from datasets import load_dataset, load_metric
from transformers import AutoConfig,AutoModelForSequenceClassification,AutoTokenizer

raw_datasets = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
raw_datasets = raw_datasets.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)

Quantization

from intel_extension_for_transformers.optimization import QuantizationConfig, metrics, objectives
from intel_extension_for_transformers.optimization.trainer import NLPTrainer

config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",config=config)
model.config.label2id = {0: 0, 1: 1}
model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'}
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(model=model, 
    train_dataset=raw_datasets["train"], 
    eval_dataset=raw_datasets["validation"],
    tokenizer=tokenizer
)
q_config = QuantizationConfig(metrics=[metrics.Metric(name="eval_loss", greater_is_better=False)])
model = trainer.quantize(quant_config=q_config)

input = tokenizer("I like Intel Extension for Transformers", return_tensors="pt")
output = model(**input).logits.argmax().item()

For more quick samples, please refer to Get Started Page. For more validated examples, please refer to Support Model Matrix

Documentation

OVERVIEW
Model Compression Neural Engine Kernel Libraries Examples
MODEL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics/Objectives Pipeline
NEURAL ENGINE
Model Compilation Custom Pattern Deployment Profiling
KERNEL LIBRARIES
Sparse GEMM Kernels Custom INT8 Kernels Profiling Benchmark
ALGORITHMS
Length Adaptive Data Augmentation
TUTORIALS AND RESULTS
Tutorials Supported Models Model Performance Kernel Performance

Selected Publications/Events