tensorflow/tensorrt

Why no improvemenmt in the object_detection example

chinesesoft8 opened this issue · 12 comments

tensorflow-gpu1.14
CUDA10.1
GPU gtx1080

my_test.json
{
"model_config": {
"model_name": "ssd_resnet_50_fpn_coco",
"input_dir": "/home/liujt/software/tensorrt/data",
"batch_size": 8,
"override_nms_score_threshold": 0.3
},
"optimization_config": {
"use_trt": true,
"precision_mode": "INT8",
"calib_images_dir": "/home/liujt/software/tensorrt/data/train2017",
"num_calib_images": 8,
"calib_batch_size": 8,
"calib_image_shape": [640, 640],
"max_workspace_size_bytes": 17179869184
},
"benchmark_config": {
"images_dir": "/home/liujt/software/tensorrt/data/val2017",
"annotation_path": "/home/liujt/software/tensorrt/data/annotations/instances_val2017.json",
"batch_size": 8,
"image_shape": [640, 640],
"num_images": 4096,
"output_path": "/home/liujt/software/tensorrt/data/stats/ssd_resnet_50_fpn_coco_trt_int8.json"
},
"assertions": [
"statistics['map'] > (0.277 - 0.01)"
]
}

I have test two times
one
"optimization_config": {
"use_trt": true,
two
"optimization_config": {
"use_trt": false,

command is python -m tftrt.examples.object_detection.test my_test.json >2.log 2>&1
1.log
2.log

use_trt = true
{
"avg_latency_ms": 533.185324338403,
"avg_throughput_fps": 15.004163908537258,
"map": 0.2776245020621034
}
ASSERTION PASSED: statistics['map'] > (0.277 - 0.01)

use_trt = false
{
"avg_latency_ms": 523.7633502129281,
"avg_throughput_fps": 15.274073676112925,
"map": 0.2776245020621034
}
ASSERTION PASSED: statistics['map'] > (0.277 - 0.01)

You are using NVIDIA GTX1080 graphic card which is last generation pascal architecture, they do not have tensor core yet, therefore will not support FP16/INT8 automatic mixture precision. So you basically use FP32 to do tensorRT inference. So there is no acceleration.

You are using NVIDIA GTX1080 graphic card which is last generation pascal architecture, they do not have tensor core yet, therefore will not support FP16/INT8 automatic mixture precision. So you basically use FP32 to do tensorRT inference. So there is no acceleration.

thanks very much, use FP32 to do tensorRT inference compared not use tensorRT interface has no improvement ? But tensorRT has optimise the graph, does this cann't accelate?

but in your my_test.json, in precision_mode you are using INT8 instead of FP32, maybe you can change it and re-try again ?

NMS (non-maximum--suppression) is quite expensive.

There is an NMS op called combined_nms that you can use for SSD.
TF-TRT can optimize that op quite well.

When you build the model using the object detection API, try use_combined_nms in the postprocessing pipeline config file.

@pooyadavoodi if using CombineNonMaxSuppression in TF2.0 and TensorRT 6.1, does it support Int8 inference?

If not, can I force the converter to use CPU for NMS? It looks like ConverterV2 does not have a blacklist op parameter.

What is best practice if I cannot use some op in Int8 mode with TF2?

I think TRT will fallback to fp32 in case int8 is not supported for a layer/plugin. In general, you shouldn't worry about if some ops aren't supported in int8 because that's taken care of inside TRT.

That being said, there is currently no way to tell the converter to not convert a particular op.

@pooyadavoodi

Thank you for your reply. When I tried using TrtGraphConverterV2, I got this error in Int8 mode:

E tensorflow/core/common_runtime/executor.cc:654] Executor failed to create kernel. Not found: No registered 'CombinedNonMaxSuppression' OpKernel for 'GPU' devices compatible with node {{node StatefulPartitionedCall/yolov3/yolo_nms/combined_non_max_suppression/CombinedNonMaxSuppression}}
	.  Registered:  device='CPU'

Could this be an indication that the conversion failed because no Int8 kernel is available for CombinedNonMaxSuppression? Shouldn't TRT handle it by falling back to float32 for this operation or the entire model?

@pooyadavoodi

Thank you for your reply. When I tried using TrtGraphConverterV2, I got this error in Int8 mode:

E tensorflow/core/common_runtime/executor.cc:654] Executor failed to create kernel. Not found: No registered 'CombinedNonMaxSuppression' OpKernel for 'GPU' devices compatible with node {{node StatefulPartitionedCall/yolov3/yolo_nms/combined_non_max_suppression/CombinedNonMaxSuppression}}
	.  Registered:  device='CPU'

Could this be an indication that the conversion failed because no Int8 kernel is available for CombinedNonMaxSuppression? Shouldn't TRT handle it by falling back to float32 for this operation or the entire model?

I think the log says TF is trying to run combined_nms instead of TRT, which means TF-TRT couldn't convert that to TRT for some reason. Hopefully the conversion log explains why that particular conversion failed (you might need to increase the verbosity of the log). I have tested the conversion for SSD but not for Yolo. It's possible that the op attributes have unsupported values.

TF fails to run combined_nms because that op doesn't have a GPU implementation in TF. You might be able to manually set the device of that op to CPU to get around the problem.

It has smth to do with Model input type - float_image_tensor vs image_tensor.

SSD Mobilenet v2 exported with combined NMS and input_type=float_image_tensor (float32, batch NONE) works fine with TRT.
But the same model exported with combined NMS but input_type=image_tensor (uint8, batch 1) fails with TRT.

2021-12-16 02:59:55.211762: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:587] Running native segment forPartitionedCall/TRTEngineOp_0_1 due to failure in verifying input shapes: Input shapes do not match input partial shapes stored in graph, for PartitionedCall/TRTEngineOp_0_1: [[1,576,19,19], [1,1280,10,10], [1,512,5,5], [1,256,2,2], [1,128,1,1], [1,256,3,3], [1,1917,4], [1], [1], [1], [1]] != [[1,576,19,19], [1,1280,10,10], [1,512,5,5], [1,256,2,2], [1,128,1,1], [1,256,3,3], [1,1917,4], [1,1], [1,1], [1,1], [1,1]]
2021-12-16 02:59:55.224076: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at trt_engine_op.cc:400 : Not found: No registered 'CombinedNonMaxSuppression' OpKernel for 'GPU' devices compatible with node {{node StatefulPartitionedCall/Postprocessor/CombinedNonMaxSuppression/combined_non_max_suppression/CombinedNonMaxSuppression}}
	.  Registered:  device='CPU'

	 [[StatefulPartitionedCall/Postprocessor/CombinedNonMaxSuppression/combined_non_max_suppression/CombinedNonMaxSuppression]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1679, in _call_impl
    cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1756, in _call_with_structured_signature
    self._structured_signature_check_missing_args(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1777, in _structured_signature_check_missing_args
    ", ".join(sorted(missing_arguments))))
TypeError: signature_wrapper(*, input_tensor) missing required arguments: input_tensor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1669, in __call__
    return self._call_impl(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1683, in _call_impl
    cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1736, in _call_with_flat_signature
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py", line 116, in _call_flat
    cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError:  No registered 'CombinedNonMaxSuppression' OpKernel for 'GPU' devices compatible with node {{node StatefulPartitionedCall/Postprocessor/CombinedNonMaxSuppression/combined_non_max_suppression/CombinedNonMaxSuppression}}
	.  Registered:  device='CPU'

	 [[StatefulPartitionedCall/Postprocessor/CombinedNonMaxSuppression/combined_non_max_suppression/CombinedNonMaxSuppression]]
	 [[PartitionedCall/TRTEngineOp_0_1]] [Op:__inference_signature_wrapper_133833]

Function call stack:
signature_wrapper

TRT throws the same problem even after changing the input for the model. Has anyone found a solution to this?

I am receiving a warning at conversion time: 'additional_fields is not supported by combined_nms.' Not sure if it has to do with this or not.