tf2onnx produces a graph not good for performance

Question

tf2onnx produces a graph not good for performance

Rayndell opened this issue 10 months ago · 1 comments

Describe the bug
I converted a frozen TensorFlow model (in the form of a .pb file) to ONNX using the following command:

python -m tf2onnx.convert --input rfcn_WIDERFACE.pb --inputs image_tensor:0[1,-1,-1,3] --outputs num_detections:0,detection_scores:0,detection_classes:0,detection_boxes:0 --output rfcn_WIDERFACE.onnx --opset=15

It gave me an ONNX file that is much slower on GPU (almost 2 seconds per image in average) than on CPU (0.35 seconds per image) with the ONNX runtime both in Python and C++. After analysis of the models it turned out that there is a lot of Loop subgraphs in the model, which is most likely the cause of this lack of performance:

Can this be a bug in the tf2onnx program? I installed the latest tf2onnx via pip install.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 18.04*): Windows Server 2016
TensorFlow Version: 2.9
Python version: 3.8
ONNX version (if applicable, e.g. 1.11*): 1.16
ONNXRuntime version (if applicable, e.g. 1.11*): 1.16

To Reproduce
python -m tf2onnx.convert --input rfcn_WIDERFACE.pb --inputs image_tensor:0[1,-1,-1,3] --outputs num_detections:0,detection_scores:0,detection_classes:0,detection_boxes:0 --output rfcn_WIDERFACE.onnx --opset=15

The original .pb model: https://evolucare-my.sharepoint.com/:u:/p/a_ducournau/EdHJfemstxxOjNA0uTGfmEUBzHIfEhiSHHQ20jZ-v_zY0w?e=Ew1QSa

The produced ONNX model: https://evolucare-my.sharepoint.com/:u:/p/a_ducournau/ETufzCteZplCjU-ODydjq9QBI9vdXQ-MIE8FthiJdxR2rA?e=Uekq9w

Answer 1 · 2024-03-11T06:13:10.000Z

Could you please share more about your analysis about why those Loop ops caused the performance on GPU is worse than on CPU?

If this is the case, probably we need to open an issue in onnxruntime repo to discover the possible reason whey this op has a worse performance in GPU.

From tf2onn perspective, if the ONNX model performance on CPU is worse than TF model performance on CPU, we might consider redesign the conversion for an improvement. But for your case, probably we need to check the implementation difference of Loop op on GPU and CPU.

Thoughts?