/Deep-Learning-Accelerator-SW

NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.

Primary LanguagePythonOtherNOASSERTION

Deep Learning Accelerator

NVIDIA DLA hardware is a fixed-function accelerator engine targeted for deep learning operations. It’s designed to do full hardware acceleration of convolutional neural networks, supporting various layers such as convolution, deconvolution, fully connected, activation, pooling, batch normalization, and others. NVIDIA’s Orin SoCs feature up to two second-generation DLAs while Xavier SoCs feature up to two first-generation DLAs.

DLA software consists of the DLA compiler and the DLA runtime stack. The offline compiler translates the neural network graph into a DLA loadable binary and can be invoked using NVIDIA TensorRT™, NvMedia-DLA or cuDLA. The runtime stack consists of the DLA firmware, kernel mode driver, and user mode driver.

DLA Reference Models

In this repo, we will cover a few key DLA-related metrics for standard deep learning model architectures in the context of common reference application implementations.

The goal is to provide a reference baseline about network architectures and how they map to DLA as well as the INT8 accuracy of these networks.

Use case Network INT8 Accuracy on Orin’s DLA Layers always running on GPU Instructions
Classification ResNet-50 Top-1 ImageNet 2012*: 76.34%
(GPU INT8: 76.42%, FP32 reference: 76.46%)
Top-K (Last node of the network) See ResNet-50 section in scripts/prepare_models/README.md
Object Detection SSD-ResNet-34 mAP COCO 2017*: 0.21
(GPU INT8: 0.21, FP32 reference 0.20)
NMS (Last node of the network) See SSD-ResNet-34 section in scripts/prepare_models/README.md
Object Detection SSD-MobileNetV1 mAP COCO 2017*: 0.23
(GPU INT8: 0.23, FP32 reference: 0.23)
NMS (Last node of the network) See SSD-MobileNetV1 section in scripts/prepare_models/README.md

*Accuracy measured internally by NVIDIA, there may be slight differences compared to previous MLPerf submissions.

Key takeaways:

  • Networks tend to have common network backbones and some variation at the end. You can run the compute-intensive backbones on DLA and final layers for post-processing on the GPU.
  • GPU and DLA are not bitwise accurate. So the difference in the math is expected and the difference would be within a certain acceptable % of the FP32 reference.

More resources:

Setup

Install the Python depencencies with:

python3 -m pip install requirements.txt