tensorflow/tensorrt

No performance improvement with TF-TRT optimization (ResNet50, DenseNet121)

srinivas-varadharajan opened this issue · 8 comments

System Information:

Container Image: nvcr.io/nvidia/tensorflow:19.10-py3

OS Platform and Distribution: Ubuntu 18.04
TensorFlow Version: tensorflow-gpu 1.14.0+nv
Python Version: 3.6.8
CUDA/cuDNN version: 10.1 / 7.6.4
GPU Model and Memory: T4 / 16 GB

Hey Guys,

I don't see any performance improvements while using TF-TRT optimized graph for inference. I've tried both ResNet50 and DenseNet121. It takes 161s to run inference on validation dataset when using unoptimized TF frozen graph and 160 seconds to run inference when using optimized TFTRT frozen graph. I'm using the TensorFlow NGC container. The converted graphs are FP16. I'm not sure if I'm doing the TFTRT conversion incorrectly.

Code to convert and save the optimized TRT model:

import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt

def get_frozen_graph(graph_file):
    """Read Frozen Graph file from disk."""
    with tf.gfile.FastGFile(graph_file, "rb") as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
    return graph_def


print("Load frozen graph from disk")

frozen_graph = get_frozen_graph("/workspace/saved_models/resnet_trained.pb")

#output_names = model.output.op.name

for node in frozen_graph.node:
    final_node_name = node.name

print("Optimize the model with TensorRT")

trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=[final_node_name],
    max_batch_size=128,
    is_dynamic_op=True,
    precision_mode='FP16',
    minimum_segment_size=3
)

print("Write optimized model to the file")
with open("/workspace/saved_models/resnet_fp16_trt_test.pb", 'wb') as f:
    f.write(trt_graph.SerializeToString())

I've also tried to use TrtGraphConverter instead of create_inference_graph, but it just makes the inference time worse.

converter = trt.TrtGraphConverter(
    input_graph_def=frozen_graph,
    nodes_blacklist=[final_node_name],#output Nodes
    max_batch_size=128,
    is_dynamic_op=True,
    precision_mode="FP16") #use dynamic mode if the graph as undefined shapes
trt_graph = converter.convert()

Part of log while converting the graph:

2019-11-08 19:53:57.289698: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:837] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 555 nodes succeeded.
2019-11-08 19:53:57.363554: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:752] Optimization results for grappler item: tf_graph
2019-11-08 19:53:57.363608: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:754] constant folding: Graph size after: 555 nodes (-320), 570 edges (-320), time = 400.515ms.
2019-11-08 19:53:57.363618: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:754] layout: Graph size after: 560 nodes (5), 572 edges (2), time = 101.55ms.
2019-11-08 19:53:57.363626: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:754] constant folding: Graph size after: 557 nodes (-3), 572 edges (0), time = 299.556ms.
2019-11-08 19:53:57.363635: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:754] TensorRTOptimizer: Graph size after: 3 nodes (-554), 2 edges (-570), time = 520.169ms.

Part of log while running running inference:


No. of Nodes are 3


prefix/input_1
prefix/TRTEngineOp_0
prefix/dense/Sigmoid
predicting using your model:....

Thanks.

Have you tried using TF2.0 and TensorRT 6.1?

@srinivas-varadharajan ResNet50 and DenseNet are fixed input dimension. So try is_dynamic_op=False. and tweak the max_workspace_size.

Thank you. Let me try by changing these parameters.

@srinivas-varadharajan, Did you succeed in improving ResNet50 performance w/ TF 2.x? I would like to try it myself. Your notes look very good (thanks for posting this). However, if you found no benefit, I won't bother.

@duffjay this person is using a TF-TRT version that is nearly 3 years old, any data / feedback would be irrelevant

@srinivas-varadharajan Were you able to make it work and see any improvement in the performance of the optimized model? I am trying tensorflow 2.10.0 and TensorRT 7.2.2.1.

I still don't see any improvement in the performance.

@DEKHTIARJonathan Any help is appreciated.

just to share more about my progress in the last 12 months... I shifted to using NVIDIA Deepstream (v6.1). In the past, I started with a model then tried to optimize it with TRT. If you go down the Deepstream route, you (should) start with known compatible models (and examples) that are optimized in TRT then do your transfer learning. In other words, you know it will work because you started from a known good place. I found Deepstream to be incredibly fast and capable (for computer vision tasks). You'll also see the many model examples are based on Tensorflow 1.15 (as opposed to 2.x) - not sure why. To use Deepstream at full power, you really should know/learn GStreamer and I found it easier to work with C/C++ (as opposed to Python). Tthat's more complexity - but the functionality that is configurable or proven is vastly better than programming everything yourself in Python.