tensorflow/tensorrt

BERT Large (TF2) model conversion fails

tfeher opened this issue · 5 comments

The following code loads BERT Large model from TF Hub, and tries to convert using TF-TRT.

import os
os.environ["TF_CPP_VMODULE"]="trt_engine_utils=2,trt_engine_op=2,convert_nodes=2,convert_graph=2,segment=2,trt_shape_optimization_profiles=2,trt_engine_resource_ops=2"

import tensorflow as tf
import tensorflow_hub as hub

# ## Download and save the model

tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/4'
bert_saved_model_path = 'bert_large'
bert_model = hub.load(tfhub_handle_encoder)
tf.saved_model.save(bert_model, bert_saved_model_path)

# ## Convert with TF-TRT

import numpy as np
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.compiler.tensorrt import trt_convert as trt

# ### 2.1 Helper functions
def get_func_from_saved_model(saved_model_dir):
    saved_model_loaded = tf.saved_model.load(
        saved_model_dir, tags=[tag_constants.SERVING])
    graph_func = saved_model_loaded.signatures[
        signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
    return graph_func, saved_model_loaded

def trt_convert(input_path, output_path, input_shapes, explicit_batch=False,
                dtype=np.float32, precision='FP32'):
    conv_params=trt.TrtConversionParams(
        precision_mode=precision, minimum_segment_size=50,
        max_workspace_size_bytes=12*1<<30, maximum_cached_engines=1)
    converter = trt.TrtGraphConverterV2(
        input_saved_model_dir=input_path, conversion_params=conv_params,
        use_dynamic_shape=explicit_batch,
        dynamic_shape_profile_strategy="Optimal")

    converter.convert()

    def input_fn():
        for shapes in input_shapes:
            # return a list of input tensors
            yield [np.ones(shape=x).astype(dtype) for x in shapes]

    converter.build(input_fn)
    converter.save(output_path)

# ### 2.2 Convert the model with TF-TRT

bert_trt_path = bert_saved_model_path + '_trt'
input_shapes = [[(1, 128), (1, 128), (1, 128)]]
trt_convert(bert_saved_model_path, bert_trt_path, input_shapes, True, np.int32, precision='FP16')

The conversion fails because the converted model reaches the protobuf size limit. The following error is printed to the terminal:

libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/message_lite.cc:451] tensorflow.GraphDef exceeded maximum protobuf size of 2GB: 3891330361

Problems

1. Error message

When executing the script from a Jupyter notebook, we see only the following message, which is not helpful. TF-TRT should provide a better error message.

KeyError                                  Traceback (most recent call last)
<ipython-input-13-22441554dd2a> in <module>
      1 bert_trt_path = bert_saved_model_path + '_trt'
      2 input_shapes = [[(1, 128), (1, 128), (1, 128)]]
----> 3 trt_convert(bert_saved_model_path, bert_trt_path, input_shapes, True, np.int32, precision='FP16')

<ipython-input-11-e9fba559b75c> in trt_convert(input_path, output_path, input_shapes, explicit_batch, dtype, precision)
      9         dynamic_shape_profile_strategy="Optimal")
     10 
---> 11     converter.convert()
     12 
     13     def input_fn():

/usr/local/lib/python3.8/dist-packages/tensorflow/python/compiler/tensorrt/trt_convert.py in convert(self, calibration_input_fn)
   1108     # Run TRT optimizer in Grappler to convert the graph.
   1109     self._converted_graph_def = self._run_conversion(grappler_meta_graph_def)
-> 1110     self._converted_func = wrap_function.function_from_graph_def(
   1111         self._converted_graph_def,
   1112         [tensor.name for tensor in frozen_func.inputs],

/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/wrap_function.py in function_from_graph_def(graph_def, inputs, outputs)
    651   import_graph = wrapped_import.graph
    652   return wrapped_import.prune(
--> 653       nest.map_structure(import_graph.as_graph_element, inputs),
    654       nest.map_structure(import_graph.as_graph_element, outputs))

/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
    865 
    866   return pack_sequence_as(
--> 867       structure[0], [func(*x) for x in entries],
    868       expand_composites=expand_composites)
    869 

/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
    865 
    866   return pack_sequence_as(
--> 867       structure[0], [func(*x) for x in entries],
    868       expand_composites=expand_composites)
    869 

/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py in as_graph_element(self, obj, allow_tensor, allow_operation)
   3753 
   3754     with self._lock:
-> 3755       return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
   3756 
   3757   def _as_graph_element_locked(self, obj, allow_tensor, allow_operation):

/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py in _as_graph_element_locked(self, obj, allow_tensor, allow_operation)
   3793           op = self._nodes_by_name[op_name]
   3794         else:
-> 3795           raise KeyError("The name %s refers to a Tensor which does not "
   3796                          "exist. The operation, %s, does not exist in the "
   3797                          "graph." % (repr(name), repr(op_name)))

2. TF-TRT almost triples the model size

       func.graph.as_graph_def().ByteSize(): 0.8 MiB
frozen_func.graph.as_graph_def().ByteSize(): 1.25 GiB
                            converted_func:  3.62 GiB

The frozen graph size of 1.25 GiB is the expected size of a BERT large model. The size of the converted func is unexpectedly large.

3. Protobuf size limit

There are DL models whose size is larger than 2 GiB. TF-TRT conversion will hit the protobul size limit already at the step when a frozen func is created.

Tagging @bixia1 and @DEKHTIARJonathan.

To debug the problem, I have copied the code from trt_convert.py here:

func, model = get_func_from_saved_model(bert_saved_model_path)

# Create frozen func
from tensorflow.python.framework import convert_to_constants
frozen_func = convert_to_constants.convert_variables_to_constants_v2(func)

# Prepare for Grappler optimization pass
from tensorflow.python.training import saver
grappler_meta_graph_def = saver.export_meta_graph(
        graph_def=frozen_func.graph.as_graph_def(), graph=frozen_func.graph)

from tensorflow.core.protobuf import config_pb2
from tensorflow.core.protobuf import meta_graph_pb2
from tensorflow.core.protobuf import rewriter_config_pb2
fetch_collection = meta_graph_pb2.CollectionDef()
for array in frozen_func.inputs + frozen_func.outputs:
    fetch_collection.node_list.value.append(array.name)
grappler_meta_graph_def.collection_def["train_op"].CopyFrom(fetch_collection)

grappler_session_config = config_pb2.ConfigProto()
conv_params=trt.TrtConversionParams(
        precision_mode='FP16', minimum_segment_size=50,
        max_workspace_size_bytes=12*1<<30, maximum_cached_engines=1)
custom_rewriter_config = trt._get_tensorrt_rewriter_config(
        conversion_params=conv_params,
        is_dynamic_op=True,
        max_batch_size=None,
        disable_non_trt_optimizers=False,
        use_implicit_batch=False,
        profile_strategy="Optimal")
grappler_session_config.graph_options.rewrite_options.CopyFrom(
        custom_rewriter_config)

# Convert

from tensorflow.python.grappler import tf_optimizer
converted_graph_def = tf_optimizer.OptimizeGraph(grappler_session_config, grappler_meta_graph_def, graph_id=b"tf_graph")

This last step returns an empty graph def, we should throw an error in that case, to avoid the misleading error in Problem 1 cited above.

@bixia1 While the conversion of TF Hub Bert Large models fail, there are other versions of the BERT large models that can be converted with TF-TRT. This includes the NGC Bert models, and also HuggingFace Bert large models. Here is a script which demonstrate HuggingFace BERT model conversion. You need to run pip install transformers (and pip install ipywidgets if you are using a jupyter notebook).

import tensorflow as tf
import numpy as np

from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
tf.get_logger().setLevel('ERROR')

# ## Helper functions
# In[3]:

def get_func_from_saved_model(saved_model_dir):
    saved_model_loaded = tf.saved_model.load(
        saved_model_dir, tags=[tag_constants.SERVING])
    graph_func = saved_model_loaded.signatures[
        signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
    return graph_func, saved_model_loaded

def trt_convert(input_path, output_path, input_shapes, explicit_batch=False,
                dtype=np.float32, API='new', precision='FP32'):
    conv_params=trt.TrtConversionParams(
        precision_mode=precision, minimum_segment_size=50,
        max_workspace_size_bytes=12*1<<30, maximum_cached_engines=1)
    converter = trt.TrtGraphConverterV2(
        input_saved_model_dir=input_path, conversion_params=conv_params,
        use_dynamic_shape=explicit_batch,
        dynamic_shape_profile_strategy="Optimal")

    converter.convert()

    def input_fn():
        for shapes in input_shapes:
            # return a list of input tensors
            yield [np.ones(shape=x).astype(dtype) for x in shapes]

    converter.build(input_fn)
    converter.save(output_path)

# ## Get Huggingface BERT model
from transformers import TFBertModel

# Creation of a subclass in order to define a new serving signature.

# Define input size (set any of these to None for dynamic input size).
# Note TF-TRT fails to convert with dynamic input size.
batch_size = 1
seq_length = 128

class MyOwnModel(TFBertModel):
    # Decorate the serving method with the new input_signature
    # an input_signature represents the name, the data type and the shape of an expected input
    @tf.function(input_signature=[{
        "input_ids": tf.TensorSpec((batch_size, seq_length), tf.int32, name="input_ids"),
        "attention_mask": tf.TensorSpec((batch_size, seq_length), tf.int32, name="attention_mask"),
        "token_type_ids": tf.TensorSpec((batch_size, seq_length), tf.int32, name="token_type_ids"),
    }])
    def serving(self, inputs):
        # call the model to process the inputs
        output = self.call(inputs)

        # return the formated output
        return self.serving_output(output)

# Instantiate the model with the new serving method
model = MyOwnModel.from_pretrained("bert-large-uncased")

# save it with saved_model=True in order to have a SavedModel version along with the h5 weights.
model.save_pretrained("my_hf_bert_large_model_static_shape", saved_model=True)

bert_saved_model_path = 'my_hf_bert_large_model_static_shape/saved_model/1'

# ## Convert the model with TF-TRT

bert_trt_path = bert_saved_model_path + '_trt'
input_shapes = [[(1, 128), (1, 128), (1, 128)]] 
trt_convert(bert_saved_model_path, bert_trt_path, input_shapes, True, np.int32, precision='FP16')

CC: @pkanwar23 @sanjoy @WhiteFangBuck
We talked about it this Wednesday ;)

Nyrio commented
  1. TF-TRT almost triples the model size

I have investigated why the model size triples and found two moments during conversion where duplication happens.

Constant folding pass

First, there is a nearly 2X duplication of constants in the first constant folding pass. This is due to 391 Const nodes being directly or indirectly the inputs of two distinct Identity nodes each.

E.g these nodes:

{{node unknown_42}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>]();
{{node Func/StatefulPartitionedCall/input/_46}} = Identity[T=DT_FLOAT](unknown_42);
{{node Func/StatefulPartitionedCall/StatefulPartitionedCall/input/_469}} = Identity[T=DT_FLOAT](Func/StatefulPartitionedCall/input/_46, ^Func/StatefulPartitionedCall/StatefulPartitionedCall/input_control_node/_422);
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Identity[T=DT_FLOAT](Func/StatefulPartitionedCall/StatefulPartitionedCall/input/_469);

result in these two constants:

{{node Func/StatefulPartitionedCall/input/_46}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>]();
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>](^Func/StatefulPartitionedCall/StatefulPartitionedCall/input_control_node/_422);

TRT conversion

There are 386 Const nodes that appear both in the TRT segment and in the graph after TRT conversion pass. I think that they actually correspond to almost all of the 391 duplicate Identity ops mentioned previously, e.g I see this Const in both graphs:

{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>](^Func/StatefulPartitionedCall/StatefulPartitionedCall/input_control_node/_422);
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>]();

It's duplicated because it is an input of nodes in both graphs (3 nodes in the base graph and 2 in the TRT segment).

Nyrio commented

Quick update on this: adding a "dependency" pass before "constfold" solves problem 1 and the graph becomes small enough to convert successfully (problem 2 remains).