BERT Large (TF2) model conversion fails
tfeher opened this issue · 5 comments
The following code loads BERT Large model from TF Hub, and tries to convert using TF-TRT.
import os
os.environ["TF_CPP_VMODULE"]="trt_engine_utils=2,trt_engine_op=2,convert_nodes=2,convert_graph=2,segment=2,trt_shape_optimization_profiles=2,trt_engine_resource_ops=2"
import tensorflow as tf
import tensorflow_hub as hub
# ## Download and save the model
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/4'
bert_saved_model_path = 'bert_large'
bert_model = hub.load(tfhub_handle_encoder)
tf.saved_model.save(bert_model, bert_saved_model_path)
# ## Convert with TF-TRT
import numpy as np
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.compiler.tensorrt import trt_convert as trt
# ### 2.1 Helper functions
def get_func_from_saved_model(saved_model_dir):
saved_model_loaded = tf.saved_model.load(
saved_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
return graph_func, saved_model_loaded
def trt_convert(input_path, output_path, input_shapes, explicit_batch=False,
dtype=np.float32, precision='FP32'):
conv_params=trt.TrtConversionParams(
precision_mode=precision, minimum_segment_size=50,
max_workspace_size_bytes=12*1<<30, maximum_cached_engines=1)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=input_path, conversion_params=conv_params,
use_dynamic_shape=explicit_batch,
dynamic_shape_profile_strategy="Optimal")
converter.convert()
def input_fn():
for shapes in input_shapes:
# return a list of input tensors
yield [np.ones(shape=x).astype(dtype) for x in shapes]
converter.build(input_fn)
converter.save(output_path)
# ### 2.2 Convert the model with TF-TRT
bert_trt_path = bert_saved_model_path + '_trt'
input_shapes = [[(1, 128), (1, 128), (1, 128)]]
trt_convert(bert_saved_model_path, bert_trt_path, input_shapes, True, np.int32, precision='FP16')
The conversion fails because the converted model reaches the protobuf size limit. The following error is printed to the terminal:
libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/message_lite.cc:451] tensorflow.GraphDef exceeded maximum protobuf size of 2GB: 3891330361
Problems
1. Error message
When executing the script from a Jupyter notebook, we see only the following message, which is not helpful. TF-TRT should provide a better error message.
KeyError Traceback (most recent call last)
<ipython-input-13-22441554dd2a> in <module>
1 bert_trt_path = bert_saved_model_path + '_trt'
2 input_shapes = [[(1, 128), (1, 128), (1, 128)]]
----> 3 trt_convert(bert_saved_model_path, bert_trt_path, input_shapes, True, np.int32, precision='FP16')
<ipython-input-11-e9fba559b75c> in trt_convert(input_path, output_path, input_shapes, explicit_batch, dtype, precision)
9 dynamic_shape_profile_strategy="Optimal")
10
---> 11 converter.convert()
12
13 def input_fn():
/usr/local/lib/python3.8/dist-packages/tensorflow/python/compiler/tensorrt/trt_convert.py in convert(self, calibration_input_fn)
1108 # Run TRT optimizer in Grappler to convert the graph.
1109 self._converted_graph_def = self._run_conversion(grappler_meta_graph_def)
-> 1110 self._converted_func = wrap_function.function_from_graph_def(
1111 self._converted_graph_def,
1112 [tensor.name for tensor in frozen_func.inputs],
/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/wrap_function.py in function_from_graph_def(graph_def, inputs, outputs)
651 import_graph = wrapped_import.graph
652 return wrapped_import.prune(
--> 653 nest.map_structure(import_graph.as_graph_element, inputs),
654 nest.map_structure(import_graph.as_graph_element, outputs))
/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
865
866 return pack_sequence_as(
--> 867 structure[0], [func(*x) for x in entries],
868 expand_composites=expand_composites)
869
/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
865
866 return pack_sequence_as(
--> 867 structure[0], [func(*x) for x in entries],
868 expand_composites=expand_composites)
869
/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py in as_graph_element(self, obj, allow_tensor, allow_operation)
3753
3754 with self._lock:
-> 3755 return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
3756
3757 def _as_graph_element_locked(self, obj, allow_tensor, allow_operation):
/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py in _as_graph_element_locked(self, obj, allow_tensor, allow_operation)
3793 op = self._nodes_by_name[op_name]
3794 else:
-> 3795 raise KeyError("The name %s refers to a Tensor which does not "
3796 "exist. The operation, %s, does not exist in the "
3797 "graph." % (repr(name), repr(op_name)))
2. TF-TRT almost triples the model size
func.graph.as_graph_def().ByteSize(): 0.8 MiB
frozen_func.graph.as_graph_def().ByteSize(): 1.25 GiB
converted_func: 3.62 GiB
The frozen graph size of 1.25 GiB is the expected size of a BERT large model. The size of the converted func is unexpectedly large.
3. Protobuf size limit
There are DL models whose size is larger than 2 GiB. TF-TRT conversion will hit the protobul size limit already at the step when a frozen func is created.
Tagging @bixia1 and @DEKHTIARJonathan.
To debug the problem, I have copied the code from trt_convert.py here:
func, model = get_func_from_saved_model(bert_saved_model_path)
# Create frozen func
from tensorflow.python.framework import convert_to_constants
frozen_func = convert_to_constants.convert_variables_to_constants_v2(func)
# Prepare for Grappler optimization pass
from tensorflow.python.training import saver
grappler_meta_graph_def = saver.export_meta_graph(
graph_def=frozen_func.graph.as_graph_def(), graph=frozen_func.graph)
from tensorflow.core.protobuf import config_pb2
from tensorflow.core.protobuf import meta_graph_pb2
from tensorflow.core.protobuf import rewriter_config_pb2
fetch_collection = meta_graph_pb2.CollectionDef()
for array in frozen_func.inputs + frozen_func.outputs:
fetch_collection.node_list.value.append(array.name)
grappler_meta_graph_def.collection_def["train_op"].CopyFrom(fetch_collection)
grappler_session_config = config_pb2.ConfigProto()
conv_params=trt.TrtConversionParams(
precision_mode='FP16', minimum_segment_size=50,
max_workspace_size_bytes=12*1<<30, maximum_cached_engines=1)
custom_rewriter_config = trt._get_tensorrt_rewriter_config(
conversion_params=conv_params,
is_dynamic_op=True,
max_batch_size=None,
disable_non_trt_optimizers=False,
use_implicit_batch=False,
profile_strategy="Optimal")
grappler_session_config.graph_options.rewrite_options.CopyFrom(
custom_rewriter_config)
# Convert
from tensorflow.python.grappler import tf_optimizer
converted_graph_def = tf_optimizer.OptimizeGraph(grappler_session_config, grappler_meta_graph_def, graph_id=b"tf_graph")
This last step returns an empty graph def, we should throw an error in that case, to avoid the misleading error in Problem 1 cited above.
@bixia1 While the conversion of TF Hub Bert Large models fail, there are other versions of the BERT large models that can be converted with TF-TRT. This includes the NGC Bert models, and also HuggingFace Bert large models. Here is a script which demonstrate HuggingFace BERT model conversion. You need to run pip install transformers
(and pip install ipywidgets
if you are using a jupyter notebook).
import tensorflow as tf
import numpy as np
from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
tf.get_logger().setLevel('ERROR')
# ## Helper functions
# In[3]:
def get_func_from_saved_model(saved_model_dir):
saved_model_loaded = tf.saved_model.load(
saved_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
return graph_func, saved_model_loaded
def trt_convert(input_path, output_path, input_shapes, explicit_batch=False,
dtype=np.float32, API='new', precision='FP32'):
conv_params=trt.TrtConversionParams(
precision_mode=precision, minimum_segment_size=50,
max_workspace_size_bytes=12*1<<30, maximum_cached_engines=1)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=input_path, conversion_params=conv_params,
use_dynamic_shape=explicit_batch,
dynamic_shape_profile_strategy="Optimal")
converter.convert()
def input_fn():
for shapes in input_shapes:
# return a list of input tensors
yield [np.ones(shape=x).astype(dtype) for x in shapes]
converter.build(input_fn)
converter.save(output_path)
# ## Get Huggingface BERT model
from transformers import TFBertModel
# Creation of a subclass in order to define a new serving signature.
# Define input size (set any of these to None for dynamic input size).
# Note TF-TRT fails to convert with dynamic input size.
batch_size = 1
seq_length = 128
class MyOwnModel(TFBertModel):
# Decorate the serving method with the new input_signature
# an input_signature represents the name, the data type and the shape of an expected input
@tf.function(input_signature=[{
"input_ids": tf.TensorSpec((batch_size, seq_length), tf.int32, name="input_ids"),
"attention_mask": tf.TensorSpec((batch_size, seq_length), tf.int32, name="attention_mask"),
"token_type_ids": tf.TensorSpec((batch_size, seq_length), tf.int32, name="token_type_ids"),
}])
def serving(self, inputs):
# call the model to process the inputs
output = self.call(inputs)
# return the formated output
return self.serving_output(output)
# Instantiate the model with the new serving method
model = MyOwnModel.from_pretrained("bert-large-uncased")
# save it with saved_model=True in order to have a SavedModel version along with the h5 weights.
model.save_pretrained("my_hf_bert_large_model_static_shape", saved_model=True)
bert_saved_model_path = 'my_hf_bert_large_model_static_shape/saved_model/1'
# ## Convert the model with TF-TRT
bert_trt_path = bert_saved_model_path + '_trt'
input_shapes = [[(1, 128), (1, 128), (1, 128)]]
trt_convert(bert_saved_model_path, bert_trt_path, input_shapes, True, np.int32, precision='FP16')
CC: @pkanwar23 @sanjoy @WhiteFangBuck
We talked about it this Wednesday ;)
- TF-TRT almost triples the model size
I have investigated why the model size triples and found two moments during conversion where duplication happens.
Constant folding pass
First, there is a nearly 2X duplication of constants in the first constant folding pass. This is due to 391 Const nodes being directly or indirectly the inputs of two distinct Identity nodes each.
E.g these nodes:
{{node unknown_42}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>]();
{{node Func/StatefulPartitionedCall/input/_46}} = Identity[T=DT_FLOAT](unknown_42);
{{node Func/StatefulPartitionedCall/StatefulPartitionedCall/input/_469}} = Identity[T=DT_FLOAT](Func/StatefulPartitionedCall/input/_46, ^Func/StatefulPartitionedCall/StatefulPartitionedCall/input_control_node/_422);
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Identity[T=DT_FLOAT](Func/StatefulPartitionedCall/StatefulPartitionedCall/input/_469);
result in these two constants:
{{node Func/StatefulPartitionedCall/input/_46}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>]();
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>](^Func/StatefulPartitionedCall/StatefulPartitionedCall/input_control_node/_422);
TRT conversion
There are 386 Const nodes that appear both in the TRT segment and in the graph after TRT conversion pass. I think that they actually correspond to almost all of the 391 duplicate Identity ops mentioned previously, e.g I see this Const in both graphs:
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>](^Func/StatefulPartitionedCall/StatefulPartitionedCall/input_control_node/_422);
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>]();
It's duplicated because it is an input of nodes in both graphs (3 nodes in the base graph and 2 in the TRT segment).
Quick update on this: adding a "dependency"
pass before "constfold"
solves problem 1 and the graph becomes small enough to convert successfully (problem 2 remains).