TI provides vs native ONNX inference

Question

TI provides vs native ONNX inference

Closed this issue a year ago · 4 comments

This repo is amazing, great stuff. Thanks a lot for the effort you put into this!

Do you by any chance know what is the benefit of using TI providers for compilation and inference compared to just running ONNX? E.g.

# torch.onnx.export(...)
from onnxruntime.quantization import quantize_dynamic, QuantType, quantize_static
quantize_dynamic(onnx_path, int8_onnx_path, weight_type=QuantType.QUInt8)

and on device

import onnxruntime    

ep_list = ['CPUExecutionProvider']
so = onnxruntime.SessionOptions()
session = onnxruntime.InferenceSession(int8_onnx_path, providers=ep_list, sess_options=so)

I'm trying to justify for myself all the hassle with .prototxt, it looks very tailored for yolov5. TI doesn't seem to provide any benchmarks, at least I didn't find anything

Answer 1 · 2023-05-19T07:29:18.000Z

If you don't import the TI libraries, the only execution provider you have access to is CPUExecutionProvider. It looks like you've figured this out. Meanwhile, TIDLExecutionProvider is the additional execution provider you get from TI.

Execution providers determine what hardware is used to evaluate the forward pass:

TIDLExecutionProvider uses TI's "C7x" signal processing cores, which are specialized hardware that's there for accelerating machine learning and such. There's a dedicated matrix multiply unit that it can dispatch most layers to. As a result of the TIDL pre-compilation, it will use TI's quantization scheme which is supposedly fairly efficient and high-fidelity, vs. whatever ONNX defaults to.
CPUExecutionProvider just evaluates the layers on the ARM A72 CPU cores running your Linux OS. It might use some NEON (SIMD) instructions for efficiency, but nonetheless, it's going to be just evaluating your multiplications on the CPU.

So if you use the CPU provider, you're not taking advantage of the neural network accelerator hardware. You'll get performance akin to running on a Raspberry Pi or low-end smartphone. The TDA4VM/BBAI-64 is not really designed to be used this way; you're not using the "AI" features it provides and might as well buy a cheaper part.

The TIDL provider uses the neural network features that the chip is marketed around, and are its primary selling points.

If the CPU provider is sufficiently fast for your needs, I'd recommend using a Raspberry Pi or similar instead. A Raspberry Pi 4 is $150 cheaper, uses the same CPU IP, has more cores, and is clocked only slightly lower. So you could get similar or better performance. Conversely, if you'd like to take advantage of the TDA4VM/BBAI-64, you'll need to use TI's quantization and execution provider to get the performance they promise.

Answer 2 · 2023-05-19T07:32:03.000Z

And I agree, the TI stuff is a pain. My hope is that this repo alleviates some of that burden (and I'm glad to hear it's helpful!). But nonetheless, it imposes limitations that you wouldn't have if using a Raspberry Pi (cheaper, less powerful) or NVIDIA Jetson (more expensive, more powerful). So it's good to ask whether the complexity is warranted in your use-case.

Answer 3 · 2023-05-19T09:38:33.000Z

Wow, thank you once again for such a fast and informative reply! Got it, everything's clear, closing the issue

Answer 4 · 2023-06-01T09:04:20.000Z

Hi again. Just wanted to share some info to answer my own question. The speedup I get from using TI providers for various models is huge, the models run 100-200 times faster than plain ONNX inference. Pretty impressive