The benchmarking repository provides an easy and flexible testbed to generate, run and save multiple configurations in order to compare Transformers based Neural Network models.
The overall benchmarking project leverages the Hydra framework from Facebook AI & Research which is able to generate all the given sweeps through configurations files. Currently, we provide benchmarks for 5 Deep Learning frameworks among the most used:
- PyTorch (Eager mode)
- TorchScript (Static Graph mode)
- TensorFlow 2 (Eager mode)
- TensorFlow 2 Graph (Static Graph mode)
- ONNX Runtime for Inference (Static Graph mode + Graph Optimizations)
The repository is divided into 2 principal sections:
config/
stores all the configuration files for the supported backends.backends/
stores the actual logic to generate textual inputs and execute a forward pass for the targeted backend.
Instructions presented here have been tested on Ubuntu 20.04
apt update && apt -y install python3 python3-pip python3-dev libnuma-dev
cd <repo/path>
pip install -r requirements.txt
Hydra, the configuration framework used in this project, provides a simple command-line interface to specify and override the configuration to be run.
For instance, in order to run a benchmark for ONNX Runtime on CPU with:
- Backend = ORT
- Model = bert-base-cased
- Device = CPU
- Batch Size = 1
- Sequence Length = 32
python3 src/main.py model=bert-base-cased sequence_length=32 backend=ort device=cpu
Hydra integrates a very powerful sweep generation utility which is exposed through the --multirun
command-line flag
when invoking the benchmark script.
For instance, in order to run a benchmark for PyTorch on CPU with the following specs:
- Model = bert-base-cased
- Device = CPU
- Batch Size = 1
- Sequence Length = 128
python3 src/main.py model=bert-base-cased batch_size=1 sequence_length=128 backend=pytorch device=cpu
backend
: Specify the backend(s) to use to run the benchmark{"pytorch", "torchscript", "tensorflow", "xla", "ort"}
device
: Specify on which device to run the benchmark{"cpu", "cuda"}
precision
: Specify the model's parameters data format. For now, only supportsfloat32
(i.e. full precision)num_threads
: Number of threads to use for intra-operation (-1
Detect the number of CPU cores and use this value)num_interops_threads
: Number of threads to use for inter-operation (-1
Detect the number of CPU cores and use this value)warmup_runs
: Number of warmup forward to execute before recording any benchmarking results. (Especially useful to preallocate memory buffers).benchmark_duration
: Duration (in seconds) of the benchmark in an attempt to do as many forward calls as possible within the specified duration. These runs are executed afterwarmup_runs
.
Framework exposes different features which can be enabled to tune the execution of the model on the underlying hardware. In this repository we expose some of them, essentially the most common ones.
use_torchscript
Boolean indicating if the runtime should trace the eager model to produce an optimized version.
This value is False
when using backend pytorch
and True
when using backend torchscript
use_xla
Boolean indicating if the model should be wrapped aroundtf.function(jit_compile=True)
in order to compile the underlying graph through XLA.
This value is False
when using backend tensorflow_graph
and can be enabled by config file or cmd line.
-
opset
Integer setting which version of the ONNX Opset specification to use when exporting the model -
graph_optimisation_level
Which level of optimization to apply with ONNX Runtime when loading the model. Possible values are:ORT_DISABLE_ALL
Use the raw ONNX graph without any further optimization.ORT_ENABLE_BASIC
Use basic graph optimizations which are not platform dependant.ORT_ENABLE_EXTENDED
Use more advanced technics (might include platform dependant optimizations).ORT_ENABLE_ALL
Enable all the possible optimizations (might include platform dependant optimizations).
-
execution_mode
Mode to execute the ONNX Graph. Can be either:ORT_SEQUENTIAL
Execute the graph sequentially, without looking for subgraph to execute in parallel.ORT_PARALLEL
Execute the graph potentially in parallel, looking for non-dependant subgraphs which can be run simultaneously.
The benchmarking comes with a launcher tool highly inspired by the one made available by Intel. The launcher tool helps you handle all the lower bits to configure experiments and get the best out of the platform you have.
More precisely, it will be able to configure the following elements:
- Linux transparent huge pages mechanism
- CPU cores affinity for OpenMP threads on NUMA platforms
- Memory affinity for OpenMP threads on NUMA platforms
- OpenMP configurations (KMP_AFFINITY, KMP_BLOCKTIME, OMP_NUM_THREADS, OMP_MAX_ACTIVE_LEVELS, etc.)
- Change at runtime the OpenMP library to be used (GNU / Intel)
- Change the memory allocation library to be used (std, tcmalloc, jemalloc)
- Setup multi-instances inference (multi independent models executing in parallel) with per-instance CPU core/memory affinity
The launcher script launcher.py
is located at the root of transformers-benchmarks folder.
You can run python launcher.py --help
to get all the tuning options available.
--multirun model=bert-base-cased backend=pytorch,torchscript,tensorflow,xla,ort
--multirun model=bert-base-cased batch_size=1 sequence_length=32 backend.num_threads=2,4,8 backend.num_interops_threads=2,4,8
python launcher.py --kmp_affinity=<value_here> -- src/main.py model=bert-base-cased batch_size=1 sequence_length=32 ...
Tuning number of model instances (multi-instance setup) along with intra/inter ops for parallel sections
python launcher.py --ninstances=4 -- src/main.py model=bert-base-cased batch_size=1 sequence_length=32 ...
export TCMALLOC_LIBRARY_PATH=</path/to/tcmalloc/libtcmalloc.so>
python launcher.py --enable_tcmalloc -- src/main.py model=bert-base-cased batch_size=1 sequence_length=32 ...
export INTEL_OPENMP_LIBRARY_PATH=</path/to/intel/openmp/libomp.so>
python launcher.py --enable_iomp -- src/main.py model=bert-base-cased batch_size=1 sequence_length=32 ...
python launcher.py --enable_thp -- src/main.py model=bert-base-cased batch_size=1 sequence_length=32 ...
python launcher.py --enable_tcmalloc --enable_iomp --ninstances=2 -- src/main.py --info config model=bert-base-cased batch_size=16 sequence_length=512