TF-TRT is slow than native-TF with 512*512 image
lxl910915 opened this issue · 5 comments
We test EAST model using TF-TRT and native-TF on GPU V100.
If image size is 256 * 256,for one inference TF-TRT takes 3ms and native-TF takes 4ms.
If image size is 512 * 512,for one inference TF-TRT takes 5ms and native-TF takes 4.5ms.
If image size is 1024 * 1024,for one inference TF-TRT takes 40ms and native-TF takes 15ms.
tf and trt version:
tensorrt (5.1.5.0)
tf-nightly-gpu (1.14.1.dev20190301)
TF-TRT is slow for large image. Is it normal? Thanks.
Please try a more recent stack for TRT and TF.
TRT 7 and TF 1.15 or better TF2 would be more easy to debug
Please try a more recent stack for TRT and TF.
TRT 7 and TF 1.15 or better TF2 would be more easy to debug
We build TRT 7 and TF 1.15. TF-TRT is still slow than native-TF.
The following log shows than 47 ops are not supported in TensorRT.
There are 47 ops of 9 different types in the graph that are not converted to TensorRT: Sigmoid, Placeholder, ConcatV2, NoOp, FusedBatchNorm, Relu, MaxPool, BatchToSpaceND, SpaceToBatchND, (For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops).
For example, ConcatV2 is already supported in tf-1.12, refer
to https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tf-112.
Why does not tf-1.15 support ConcatV2? Does TF-TRT is backward compatible ?
For our EAST model, TF-TRT generates 30 TRTEngineOp. And most of those TRTEngineOp only contains 6 nodes. Maybe this is the root cause.
2020-02-28 07:12:47.057549: I tensorflow/compiler/tf2tensorrt/segment/segment.cc:460] There are 47 ops of 9 different types in the graph that are not converted to TensorRT: Sigmoid, Placeholder, ConcatV2, NoOp, FusedBatchNorm, Relu, MaxPool, BatchToSpaceND, SpaceToBatchND, (For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops).
2020-02-28 07:12:47.065898: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:633] Number of TensorRT candidate segments: 30
2020-02-28 07:12:47.157808: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 23 nodes succeeded.
2020-02-28 07:12:47.157983: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node feature_fusion/TRTEngineOp_1 added for segment 1 consisting of 104 nodes succeeded.
2020-02-28 07:12:47.158269: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node resnet_v1_50/block1/unit_1/bottleneck_v1/conv2/TRTEngineOp_2 added for segment 2 consisting of 11 nodes succeeded.
2020-02-28 07:12:47.158369: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node resnet_v1_50/block1/unit_1/bottleneck_v1/conv3/TRTEngineOp_3 added for segment 3 consisting of 6 nodes succeeded.
2020-02-28 07:12:47.158425: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node resnet_v1_50/block1/unit_1/bottleneck_v1/shortcut/TRTEngineOp_4 added for segment 4 consisting of 6 nodes succeeded.
...
Could you give me a reproducible use case, python script + shell command.
I'll try to give it a look
Could you give me a reproducible use case, python script + shell command.
I'll try to give it a look
Thank you!
We upload our saved_model.pb.
Software and hardware version:
TRT 7.0.0.11
TF 1.15.0
tesla v100-sxm2-16gb GPU
Untar it to /tmp/SavedModel-1024-1024.
For running saved_model, the shell is 'CUDA_VISIBLE_DEVICES=0 python east_sm.py'. One inference costs 25ms.
For running tf-trt, the shell is 'CUDA_VISIBLE_DEVICES=0 python east_tftrt.py'. One inference costs 40ms.
Then, we use nvprof to prof tf-trt CUDA_VISIBLE_DEVICES=0 nvprof python east_tftrt.py
, and we find that genericReformat::copyPackedKernel occupy 60% time.
==33264== NVTX result:
==33264== Warning: Found 198911 invalid range marker(s)
==33264== Thread "<unnamed>" (id = 268433152)
==33264== Domain "TensorRT"
==33264== Range "<unnamed>"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 1.57525s 72779 21.644us 7.3260us 850.74us <unnamed>
GPU activities: 58.34% 4.46830s 33826 132.10us 2.3030us 2.0179ms void genericReformat::copyPackedKernel<float, float, bool=0, bool=1, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
The detal log is here.