TF-TRT is slow than native-TF with 512*512 image

Question

TF-TRT is slow than native-TF with 512*512 image

lxl910915 opened this issue 4 years ago · 5 comments

We test EAST model using TF-TRT and native-TF on GPU V100.
If image size is 256 * 256，for one inference TF-TRT takes 3ms and native-TF takes 4ms.
If image size is 512 * 512，for one inference TF-TRT takes 5ms and native-TF takes 4.5ms.
If image size is 1024 * 1024，for one inference TF-TRT takes 40ms and native-TF takes 15ms.

tf and trt version:

  tensorrt (5.1.5.0)
  tf-nightly-gpu (1.14.1.dev20190301)

TF-TRT is slow for large image. Is it normal? Thanks.

Answer 1 · 2020-02-27T21:10:04.000Z

Please try a more recent stack for TRT and TF.
TRT 7 and TF 1.15 or better TF2 would be more easy to debug

Answer 2 · 2020-02-28T04:00:20.000Z

Please try a more recent stack for TRT and TF.
TRT 7 and TF 1.15 or better TF2 would be more easy to debug

We build TRT 7 and TF 1.15. TF-TRT is still slow than native-TF.
The following log shows than 47 ops are not supported in TensorRT.
There are 47 ops of 9 different types in the graph that are not converted to TensorRT: Sigmoid, Placeholder, ConcatV2, NoOp, FusedBatchNorm, Relu, MaxPool, BatchToSpaceND, SpaceToBatchND, (For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops).

For example, ConcatV2 is already supported in tf-1.12, refer
to https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tf-112.
Why does not tf-1.15 support ConcatV2? Does TF-TRT is backward compatible ?

Answer 3 · 2020-02-28T07:25:50.000Z

For our EAST model, TF-TRT generates 30 TRTEngineOp. And most of those TRTEngineOp only contains 6 nodes. Maybe this is the root cause.

2020-02-28 07:12:47.057549: I tensorflow/compiler/tf2tensorrt/segment/segment.cc:460] There are 47 ops of 9 different types in the graph that are not converted to TensorRT: Sigmoid, Placeholder, ConcatV2, NoOp, FusedBatchNorm, Relu, MaxPool, BatchToSpaceND, SpaceToBatchND, (For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops).
2020-02-28 07:12:47.065898: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:633] Number of TensorRT candidate segments: 30
2020-02-28 07:12:47.157808: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 23 nodes succeeded.
2020-02-28 07:12:47.157983: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node feature_fusion/TRTEngineOp_1 added for segment 1 consisting of 104 nodes succeeded.
2020-02-28 07:12:47.158269: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node resnet_v1_50/block1/unit_1/bottleneck_v1/conv2/TRTEngineOp_2 added for segment 2 consisting of 11 nodes succeeded.
2020-02-28 07:12:47.158369: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node resnet_v1_50/block1/unit_1/bottleneck_v1/conv3/TRTEngineOp_3 added for segment 3 consisting of 6 nodes succeeded.
2020-02-28 07:12:47.158425: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node resnet_v1_50/block1/unit_1/bottleneck_v1/shortcut/TRTEngineOp_4 added for segment 4 consisting of 6 nodes succeeded.
...

tensorflow/tensorflow#22682 (comment)

Answer 4 · 2020-02-28T09:34:54.000Z

Could you give me a reproducible use case, python script + shell command.
I'll try to give it a look

Answer 5 · 2020-03-01T09:46:52.000Z

Could you give me a reproducible use case, python script + shell command.
I'll try to give it a look

Thank you!
We upload our saved_model.pb.

Software and hardware version:
TRT 7.0.0.11
TF 1.15.0
tesla v100-sxm2-16gb GPU

Untar it to /tmp/SavedModel-1024-1024.
For running saved_model, the shell is 'CUDA_VISIBLE_DEVICES=0 python east_sm.py'. One inference costs 25ms.
For running tf-trt, the shell is 'CUDA_VISIBLE_DEVICES=0 python east_tftrt.py'. One inference costs 40ms.

Then, we use nvprof to prof tf-trt CUDA_VISIBLE_DEVICES=0 nvprof python east_tftrt.py, and we find that genericReformat::copyPackedKernel occupy 60% time.

==33264== NVTX result:
==33264== Warning: Found 198911 invalid range marker(s)
==33264==   Thread "<unnamed>" (id = 268433152)
==33264==     Domain "TensorRT"
==33264==       Range "<unnamed>"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  1.57525s     72779  21.644us  7.3260us  850.74us  <unnamed>
 GPU activities:   58.34%  4.46830s     33826  132.10us  2.3030us  2.0179ms  void genericReformat::copyPackedKernel<float, float, bool=0, bool=1, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)

The detal log is here.