/optimum-benchmark

A repository for benchmarking HF Optimum's optimizations for inference and training.

Primary LanguagePythonApache License 2.0Apache-2.0

Optimum-Benchmark

Optimum-Benchmark is a unified multi-backend utility for benchmarking transformers, diffusers, peft and timm models with Optimum's optimizations & quantization, for inference & training, on different backends & hardwares (OnnxRuntime, Intel Neural Compressor, OpenVINO, Habana Gaudi Processor (HPU), etc).

The experiment management and tracking is handled using hydra which allows for simple cli with minimum configuration changes and maximum flexibility (inspired by tune).

Motivation

  • Many hardware vendors would want to know how their hardware performs compared to others on the same models.
  • Many HF users would want to know how their chosen model performs in terms of latency, throughput, memory usage, energy consumption, etc.
  • Optimum offers a lot of hardware and backend specific optimizations & quantization schemas that can be applied to models and improve their performance.
  • Benchmarks depend heavily on many factors, like input/hardware/releases/etc, but most don't report these factors (e.g. comparing an A100 to an RTX 3090 on a singleton batch).
  • [...]

Features

optimum-benchmark allows you to run benchmarks with no code and minimal user input, just specify:

  • The device to use (e.g. cuda).
  • The type of benchmark (e.g. training)
  • The backend to run on (e.g. onnxruntime).
  • The model name or path (e.g. bert-base-uncased)
  • And optionally, the model's task (e.g. text-classification).

Everything else is either optional or inferred from the model's name or path.

Supported Backends/Devices

  • Pytorch backend for CPU
  • Pytorch backend for CUDA
  • Pytorch backend for Habana Gaudi Processor (HPU)
  • OnnxRuntime backend for CPUExecutionProvider
  • OnnxRuntime backend for CUDAExecutionProvider
  • OnnxRuntime backend for TensorrtExecutionProvider
  • Intel Neural Compressor backend for CPU
  • OpenVINO backend for CPU

Benchmark features

  • Latency and throughput tracking (default).
  • Peak memory tracking (benchmark.memory=true).
  • Energy and carbon emissions (benchmark.energy=true).
  • Warm up runs before inference (benchmark.warmup_runs=20).
  • Warm up steps during training (benchmark.warmup_steps=20).
  • Inputs shapes control (e.g. benchmark.input_shapes.sequence_length=128).
  • Dataset shapes control (e.g. benchmark.dataset_shapes.dataset_size=1000).
  • Forward and Generation pass control (e.g. for an LLM benchmark.generate.max_new_tokens=100, for a diffusion model benchmark.forward.num_images_per_prompt=4).

Backend features

  • Random weights initialization (backend.no_weights=true for fast model instantiation without downloading weights).
  • Onnxruntime Quantization and AutoQuantization (backend.quantization=true or backend.auto_quantization=avx2, etc).
  • Onnxruntime Calibration for Static Quantization (backend.quantization_config.is_static=true, etc).
  • Onnxruntime Optimization and AutoOptimization (backend.optimization=true or backend.auto_optimization=O4, etc).
  • PEFT training (backend.peft_strategy=lora, backend.peft_config.task_type=CAUSAL_LM, etc).
  • DDP training (backend.use_ddp=true, backend.ddp_config.nproc_per_node=2, etc).
  • BitsAndBytes quantization scheme (backend.quantization_scheme=bnb, backend.quantization_config.load_in_4bit, etc).
  • GPTQ quantization scheme (backend.quantization_scheme=gptq, backend.quantization_config.bits=4, etc).
  • Optimum's BetterTransformer (backend.bettertransformer=true).
  • Automatic Mixed Precision (backend.amp_autocast=true).
  • Dynamo/Inductor compiling (backend.torch_compile=true).

Quickstart

Installation

You can install optimum-benchmark using pip:

python -m pip install git+https://github.com/huggingface/optimum-benchmark.git

or by cloning the repository and installing it in editable mode:

git clone https://github.com/huggingface/optimum-benchmark.git && cd optimum-benchmark

python -m pip install -e .

Depending on the backends you want to use, you might need to install some extra dependencies:

  • OpenVINO: pip install optimum-benchmark[openvino]
  • OnnxRuntime: pip install optimum-benchmark[onnxruntime]
  • OnnxRuntime-GPU: pip install optimum-benchmark[onnxruntime-gpu]
  • Intel Neural Compressor: pip install optimum-benchmark[neural-compressor]
  • Text Generation Inference: pip install optimum-benchmark[text-generation-inference]

You can now run a benchmark using the command line by specifying the configuration directory and the configuration name. Both arguments are mandatory for hydra. config-dir is the directory where the configuration files are stored and config-name is the name of the configuration file without its .yaml extension.

optimum-benchmark --config-dir examples/ --config-name pytorch_bert

This will run the benchmark using the configuration in examples/pytorch_bert.yaml and store the results in runs/pytorch_bert.

The result files are inference_results.csv, the program's logs experiment.log and the configuration that's been used hydra_config.yaml. Some other files might be generated depending on the configuration (e.g. forward_codecarbon.csv if benchmark.energy=true).

The directory for storing these results can be changed by setting hydra.run.dir (and/or hydra.sweep.dir in case of a multirun) in the command line or in the config file.

Command-line configuration overrides

It's easy to override the default behavior of a benchmark from the command line.

optimum-benchmark --config-dir examples/ --config-name pytorch_bert model=gpt2 device=cuda:1

Multirun configuration sweeps

You can easily run configuration sweeps using the -m or --multirun option. By default, configurations will be executed serially but other kinds of executions are supported with hydra's launcher plugins : hydra/launcher=submitit, hydra/launcher=rays, hydra/launcher=joblib, etc.

optimum-benchmark --config-dir examples --config-name pytorch_bert -m device=cpu,cuda

Also, for integer parameters like batch_size, one can specify a range of values to sweep over:

optimum-benchmark --config-dir examples --config-name pytorch_bert -m device=cpu,cuda benchmark.input_shapes.batch_size='range(1,10,step=2)'

Reporting benchmark results (WIP)

To aggregate the results of a benchmark (run(s) or sweep(s)), you can use the optimum-report command.

optimum-report --experiments {experiments_folder_1} {experiments_folder_2} --baseline {baseline_folder} --report-name {report_name}

This will create a report in the reports folder with the name {report_name}. The report will contain the results of the experiments in {experiments_folder_1} and {experiments_folder_2} compared to the results of the baseline in {baseline_folder} in the form of a .csv file, an .svg rich table and (a) .png plot(s).

You can also reuse some components of the reporting script for your use case (examples in [examples/training-llamas] and [examples/running-llamas]).

Configurations structure

You can create custom configuration files following the examples here. You can also use hydra's composition with a base configuration (examples/pytorch_bert.yaml for example) and override/define parameters.

To create a configuration that uses a wav2vec2 model and onnxruntime backend, it's as easy as:

defaults:
  - pytorch_bert
  - _self_
  - override backend: onnxruntime

experiment_name: onnxruntime_wav2vec2
model: bookbot/distil-wav2vec2-adult-child-cls-37m
device: cpu

Other than the examples, you can also check tests.

Contributing

Contributions are welcome! And we're happy to help you get started. Feel free to open an issue or a pull request. Things that we'd like to see:

  • More backends (Tensorflow, TFLite, Jax, etc).
  • More tests (right now we only have few tests per backend).
  • Task evaluators for the most common tasks (would be great for output regression).
  • More hardware support (Habana Gaudi Processor (HPU), RadeonOpenCompute (ROCm), etc).