In this session, you will learn how to optimize Hugging Face Transformers models using Optimum. The session will show you how to dynamically quantize and optimize a DistilBERT model using Hugging Face Optimum and ONNX Runtime. Hugging Face Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware.
Note: dynamic quantization is currently only supported for CPUs, so we will not be utilizing GPUs / CUDA in this session.
By the end of this session, you see how quantization and optimization with Hugging Face Optimum can result in significant increase in model latency while keeping almost 100% of the full-precision model. Furthermore, you’ll see how to easily apply some advanced quantization and optimization techniques shown here so that your models take much less of an accuracy hit than they would otherwise.
You will learn how to:
- Setup Development Environment
- Convert a Hugging Face
Transformers
model to ONNX for inference - Apply graph optimization techniques to the ONNX model
- Apply dynamic quantization using ORTQuantizer from 🤗 Optimum
- Test inference with the quantized model
- Evaluate the performance and speed
- Push the quantized model to the Hub
- Load and run inference with a quantized model from the hub
Let's get started! 🚀
Miniconda or Micromamba setup (conda alternative but smaller)
Miniconda
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
pip install -r requirements.txt
This is a template repository for Text Classification using Optimum and onnxruntime to support generic inference with Hugging Face Hub generic Inference API. There are two required steps:
- Specify the requirements by defining a
requirements.txt
file. - Implement the
pipeline.py
__init__
and__call__
methods. These methods are called by the Inference API. The__init__
method should load the model and preload the optimum model and tokenizers as well as thetext-classification
pipeline needed for inference. This is only called once. The__call__
method performs the actual inference. Make sure to follow the same input/output specifications defined in the template for the pipeline to work.
add
library_name: generic
to the readme.
note: the generic
community image currently only support inputs
as parameter and no parameter.