VeriSilicon/tflite-vx-delegate

WARNING: Fallback unsupported op 32 to TfLite

Closed this issue · 4 comments

Which Khadas SBC do you use?
VIM3 A311D
Which system do you use?
Kernel 4.9 Ubuntu 20.04

Issue below:

I have tflite model for Object Detection, which I want to run on NPU to get the acceleration. I used NPU TFlite vx delegates as mentioned in examples. Inferences with models mentioned in examples are fast but when I tried to use our model inferences are slow. Also, it's giving the WARNING: Fallback unsupported op 32 to TfLite.

Output while running the code:
Log with NPU:
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
WARNING: Fallback unsupported op 32 to TfLite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

What exactly this error is ? And how to resolve it ?
Code and Model

Hi,Don't worry about that, I checked the structure of your model, there exists an op named "TFlite_Detection_PostProcess", which is not a standard builtin op defined by tflite(it is a custom op(op 32) actuallly) .The custom op was not directly support by our vx-delegate.
The fallback mechanism is used to handle op that we do not support, dividing the complete graph into parts that our npu can run and the remaining graphs that the CPU can execute.

@chenfeiyue-cfy, but I come to know that inference of this model on NPU is slow compared to CPU. Can you please guide me on this? Logs with NPU and with CPU for your reference. Thank You

Log with NPU:
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
WARNING: Fallback unsupported op 32 to TfLite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
[[0.21484375 0.16015625 0.09765625 0.0859375 0.07421875 0.05859375
0.05078125 0.046875 0.046875 0.046875 0.046875 0.046875
0.0390625 0.03515625 0.03515625 0.03515625 0.03515625 0.03125
0.03125 0.03125 0.03125 0.02734375 0.02734375 0.02734375
0.02734375]]
4.172280788421631 sec

Log with CPU:
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
[[0.20703125 0.16796875 0.09765625 0.078125 0.05859375 0.0546875
0.05078125 0.046875 0.04296875 0.04296875 0.04296875 0.04296875
0.0390625 0.03515625 0.03515625 0.03515625 0.03515625 0.03515625
0.03515625 0.03515625 0.03125 0.03125 0.03125 0.03125
0.02734375]]
0.19845318794250488 sec

Hi, sorry for late reply. In your python code, timing starts before load_delegate, so that it will inevitably include initial compilation compilation and optimization time rather than just runtime.

Thank you @chenfeiyue-cfy for your support.