/Deep-Learning-Accelerator-SW

NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.

Primary LanguagePythonOtherNOASSERTION

Deep Learning Accelerator

NVIDIA DLA hardware is a fixed-function accelerator engine targeted for deep learning operations. It’s designed to do full hardware acceleration of convolutional neural networks, supporting various layers such as convolution, deconvolution, fully connected, activation, pooling, batch normalization, and others. NVIDIA’s Orin SoCs feature up to two second-generation DLAs while Xavier SoCs feature up to two first-generation DLAs.

DLA software consists of the DLA compiler and the DLA runtime stack. The offline compiler translates the neural network graph into a DLA loadable binary and can be invoked using NVIDIA TensorRT™, NvMedia-DLA or cuDLA. The runtime stack consists of the DLA firmware, kernel mode driver, and user mode driver.

Reference for more details: DLA product page

Why is DLA essential on Orin?

  • DLA peak performance contributes between 38% and 74% to Orin's total Deep Learning (DL) performance (depending on the power mode, see table below)
  • DLA is on average 3x to 5x more power-efficient than the GPU (depending on the power mode and the workload, see DLA Performance per Watt (Power Efficiency) for details)

Here the distribution of DL TOPs between GPU and DLA on a Jetson AGX Orin 64GB, depending on the power mode:

Power mode: MAXN Power mode: 50W Power mode: 30W Power mode: 15W
GPU sparse INT8 peak DL performance 171 TOPs 109 TOPs 41 TOPs 14 TOPs
2x DLA sparse INT8 peak performance 105 TOPs 92 TOPs 90 TOPs 40 TOPs
Total Orin peak INT8 DL performance 275 TOPs 200 TOPs 131 TOPs 54 TOPs
Percentage: DLA peak INT8 performance of total Orin peak DL INT8 performance 38% 46% 69% 74%

Note:

  • The DLA TOPs of the 30W & 50W power modes on Jetson AGX Orin 64GB are comparable to the maximum clocks on DRIVE Orin platforms for Automotive
  • The maximum DLA TOPs on Jetson Orin NX 16GB are comparable to the 15W power mode on Jetson AGX Orin 64GB

DLA Reference Models

In this repo, we will cover a few key DLA-related metrics for standard deep learning model architectures in the context of common reference application implementations.

The goal is to provide a reference baseline about network architectures and how they map to DLA as well as the INT8 accuracy of these networks.

Accuracy on DLA and GPU

Use case Network INT8 Accuracy on Orin’s DLA Layers always running on GPU Instructions
Object Detection RetinaNet ResNeXt-50 mAP OpenImages MLPerf validation set*: 0.3741
(GPU INT8: 0.3740, FP32 reference 0.3757)
NMS (Last node of the network) See RetinaNet ResNeXt-50 section in scripts/prepare_models/README.md
Classification ResNet-50 Top-1 ImageNet 2012*: 76.34%
(GPU INT8: 76.42%, FP32 reference: 76.46%)
Top-K (Last node of the network) See ResNet-50 section in scripts/prepare_models/README.md
Object Detection SSD-ResNet-34 mAP COCO 2017*: 0.21
(GPU INT8: 0.21, FP32 reference 0.20)
NMS (Last node of the network) See SSD-ResNet-34 section in scripts/prepare_models/README.md
Object Detection SSD-MobileNetV1 mAP COCO 2017*: 0.23
(GPU INT8: 0.23, FP32 reference: 0.23)
NMS (Last node of the network) See SSD-MobileNetV1 section in scripts/prepare_models/README.md

*Accuracy measured internally by NVIDIA, there may be slight differences compared to previous MLPerf Inference submissions.

Key takeaways:

  • Networks tend to have common network backbones and some variation at the end. You can run the compute-intensive backbones on DLA and final layers for post-processing on the GPU.
  • GPU and DLA do not produce bitwise identical results. So the difference in the math is expected and the difference would be within a certain acceptable % of the FP32 reference.

More resources:

Structured Sparsity Case Study: Object Detection Accuracy with RetinaNet ResNet-34

DLA on the NVIDIA Orin platform supports Structured Sparsity that offers the opportunity to minimize latency and maximize throughput in production. See the TensorRT documentation for details (note that the listed restrictions may not apply anymore to most recent DLA releases).

Below case study presents that training models for Structured Sparsity is expected to maintain accuracy:

Network Weight mask pattern Waymo Open Dataset Test Accuracy – 2D Detection 2020
IoU=0.50:0.95 (all sizes) @ FP16 on Host
Steps involved
RetinaNet ResNet-34 No mask / Dense 44.7 Trained from scratch for 26 epochs.
RetinaNet ResNet-34 2:4 Sparse 44.8 Sparsified dense model as detailed in sparsity whitepaper, then trained for another 26 epochs.

Orin DLA Performance

DLA Dense performance

2x DLA images per second on a Jetson AGX Orin 64GB in dense operation measured with JetPack 5.1.1, depending on the power mode:

Network Power mode: MAXN Power mode: 50W Power mode: 30W Power mode: 15W
RetinaNet ResNeXt-50 (800x800, bs=1) 78 72 71 36
ResNet-34 backbone (1280x768, bs=1) 285 260 255 121
RetinaNet ResNet-34 (1280x768, bs=1) 108 98 96 45
SSD-ResNet-34 (1200x1200, bs=1) 83 76 74 36
ResNet-50 (224x224, bs=2) 2037 1948 1928 1072
SSD-MobileNetV1 (300x300, bs=2) 2664 2506 2472 1313

Key takeaways:

  • DLA's peak performance for these models in MAXN power mode is close to the achieved performance with 50W and 30W power modes - even at a reduced power budget, your DLA-based inference pipeline can sustain high throughput.
  • Attaching a RetinaNet head to a ResNet-34 backbone adds a considerable number of theoretical MAC operations per inference, and is hence expected to result in lower end-to-end throughput. Choose your task-specific head that you attach to backbones carefully.
  • Even at the lowest power mode, all of the benchmarks with high input resolution and batch size 1 can still sustain 30 images per second on 2 DLAs combined.

DLA Sparse performance

Generally, Structured Sparsity shows perf improvements over dense operation for dense Convolution layers that are already math-bound. The more math-bound a layer in dense operation, the higher the expected dense->sparse speedup after applying a 2:4 sparsity pattern.

2x DLA images per second on a Jetson AGX Orin 64GB in sparse operation measured with JetPack 5.1.1, depending on the power mode (with dense->sparse speedup):

Network Power mode: MAXN Power mode: 50W Power mode: 30W Power mode: 15W
RetinaNet ResNeXt-50 (800x800, bs=1) 102 (1.31x) 96 (1.34x) 95 (1.34x) 51 (1.43x)
ResNet-34 backbone (1280x768, bs=1) 384 (1.35x) 360 (1.38x) 354 (1.39x) 176 (1.45x)
RetinaNet ResNet-34 (1280x768, bs=1) 143 (1.32x) 133 (1.36x) 131 (1.37x) 66 (1.47x)
SSD-ResNet-34 (1200x1200, bs=1) 103 (1.24x) 97 (1.28x) 95 (1.28x) 49 (1.36x)

Just add --sparsity=force to the trtexec commands from scripts/prepare_models/README.md to reproduce.

Key takeaways:

  • Decreasing the power budget on Jetson Orin generally implies reducing the ratio of DLA TOPs to DRAM bandwidth (the hardware's op per Byte ratio). Lowering the hardware's op per Byte ratio means that more layers get the chance to become math-bound, and hence we can also observe higher dense->sparse speedups with lower power modes.

DLA Performance per Watt (Power Efficiency)

DLA & GPU form your perfect team for Deep Learning inference on the SoC. While the GPU delivers the most TOPs in high-power profiles, DLA excels at power efficiency.

Below table shows the Perf/W ratio of DLA relative to the GPU (accelerator power only, perf metric: images per second) on a Jetson AGX Orin 64GB measured with JetPack 5.1.1, depending on the power mode:

Network Weight mask mode Power mode: MAXN Power mode: 50W Power mode: 30W Power mode: 15W
RetinaNet ResNeXt-50 (800x800, bs=1) Dense 3.8x 3.1x 2.8x 4.2x
RetinaNet ResNeXt-50 (800x800, bs=1) Sparse 4.7x 4.1x 3.7x 5.6x
ResNet-34 backbone (1280x768, bs=1) Dense 4.3x 3.7x 3.4x 4.9x
ResNet-34 backbone (1280x768, bs=1) Sparse 3.4x 3.0x 2.7x 4.1x
RetinaNet ResNet-34 (1280x768, bs=1) Dense 4.2x 3.5x 3.1x 4.6x
RetinaNet ResNet-34 (1280x768, bs=1) Sparse 3.5x 2.9x 2.6x 3.9x
SSD-ResNet-34 (1200x1200, bs=1) Dense 3.7x 3.0x 2.7x 4.1x
SSD-ResNet-34 (1200x1200, bs=1) Sparse 3.5x 3.3x 2.9x 4.3x
ResNet-50 (224x224, DLA bs=2, GPU bs=256) Dense 3.5x 2.9x 2.6x 3.9x
SSD-MobileNetV1 (300x300, DLA bs=2, GPU bs=128) Dense 4.2x 3.6x 3.2x 5.0x

Key takeaways:

  • DLA is about 3x to 5x more power efficient than GPU for these benchmarks
  • Enabling Structured Sparsity generally improves DLA's power efficiency
  • At the lowest power mode of 15W, DLA's power efficiency is the highest (where 74% total Orin peak DL INT8 performance comes from the DLAs)

ONNX operators supported on DLA

See operators/README.md for details on ONNX operators already supported on DLA and planned to be supported in future releases.

Setup

Install the Python dependencies with (only supported on x86 hosts):

python3 -m pip install requirements.txt