Pytorch model optimizations for inference

Find the buttleneck

Running efficinet DL model boild down to three components:

Compute : time spent on GPU running actual FLOPS
Memory :time spend to trnasder data from CPU to GPU
Overhead : everythings else: Great blog from Horace.

High level ideas

Cloud vs Edge -- many prefer edge to save cloud costs and avoid network latency.
Taget hardware -- figure out ideal hardware based on the model size/ available optimizations for HW, Cost
Choose a path to optimize model -- it's a tarade of between the effort and perfromance gain/ needs
Use Pytorch out-of-box techniques, Quantizaiton (works for CPU), Torchscript (gives you a graph/ remove dependency to Python), Pruning (not very popular)
Use model compilers --TensorRT, TVM
OR offerings by IPEX and OnnxRuntime

Target hardware

To framework on a hardware, hardware vendor need to support that framework. Hardware companies provide kernels for a number of frameworks for example Nvidia has Cuda and CuDNN.

A fundamental challenge is that different hardware types have different memory layouts and compute primitives

Supported Hardwares for Pytorch CPU/ GPU/TPU/Inferentia/Trainum

Compute primitives

CPU : Scalar, vector GPU : one dimensional Vector, Two dimensional (Tensor cores) A100, V100, T4 TPU : Two dimensional vectors

Memory layouts shown below, play an important role in the perfromance.

TVM Paper

IRs (Intermediate Representations)

Frameworks do not target many different compilers instead they provide IR as bridge between framework and hardware, then hardware compaines take the IR and compile (lower) it for their chip, machine code. Compiler take the IR, generate high level and low level code using codegen (mostly LLVM)

Trasnformer Optimizations

Flash Attention

The idea here is to avoid mutliple access to gloabl memory of GPU, how to do that?

Tiling the matmul a great Tutorial https://github.com/ELS-RD/kernl/blob/main/tutorial/1%20-%20tiled%20matmul.ipynb
Recomputation attention matix in backward pass instead of saving great tutorial https://github.com/ELS-RD/kernl/blob/main/tutorial/4%20-%20flash%20attention.ipynb

brlambert7818/pytorch-model-optimizations