Empty output on second inference call on exported saved model
louisquinn opened this issue · 1 comments
Hey everyone, and thanks @peteryuX for the great work!
I'm experiencing a weird issue where after exporting the model to the saved_model
format, on the second inference call I get an empty output. The first inference call always works though - I am seeing this with both tensorflow serving and regular inference.
Here's how to reproduce...
import tensorflow as tf
import cv2
import numpy as np
from modules.models import RetinaFaceModel
from modules.utils import set_memory_growth, load_yaml, draw_bbox_landm, pad_input_image, recover_pad_output
CONFIG_PATH = "<path-to>/configs/retinaface_res50.yaml"
CHECKPOINT_PATH = "<path-to>/retinaface-tf2/checkpoints/retinaface_res50"
OUTPUT_PATH = "<path-to>/retinaface-tf2/checkpoints/retinaface_res50_export"
def main():
image = cv2.imread("an-image-path")
image_infer = np.expand_dims(cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32), axis=0)
config = load_yaml(CONFIG_PATH)
model = RetinaFaceModel(config, training=False, iou_th=0.4, score_th=0.5)
checkpoint = tf.train.Checkpoint(model=model)
checkpoint.restore(tf.train.latest_checkpoint(CHECKPOINT_PATH))
# Here, inference works on every call.
output_ckpt = model(image_infer) # Get a result with shape: (4, 16) which is good.
output_ckpt2 = model(image_infer) # Get the same result: (4, 16)
# Save to file. I have tried all different ways to do this..
# tf.saved_model.save(model, os.path.join(OUTPUT_PATH, "saved_model"), signatures=concrete_fn)
# tf.keras.models.save_model(model, os.path.join(OUTPUT_PATH, "saved_model"))
# tf.saved_model.save(model, os.path.join(OUTPUT_PATH, "saved_model"))
# All have the same issue, let's just use the simple method...
model.save(OUTPUT_PATH)
# But if we export the model, (or make a concrete function), load it in and run twice, the second call will return an empty output.
model_loaded = tf.saved_model.load(OUTPUT_PATH)
infer = model_loaded.signatures["serving_default"]
output1 = infer(**{"input_image": tf.convert_to_tensor(image_infer)}) # Get a result with shape: (4, 16) which is good.
output2 = infer(**{"input_image": tf.convert_to_tensor(image_infer)}) # Get a result like: (0, 16)
if __name__ == "__main__":
main()
The same behaviour happens if I make a concrete function and export like so:
concrete_fn = tf.function(model.call).get_concrete_function(
tf.TensorSpec(
shape=[None, None, None, 3], dtype=tf.float32, name="image_tensor"
),
training=False
)
tf.saved_model.save(model, OUTPUT_PATH, signatures=concrete_fn)
I have a feeling this might have something do with the code not allowing for a batch at this point in the decoding...
# only for batch size 1
preds = tf.concat( # [bboxes, landms, landms_valid, conf]
[bbox_regressions[0], landm_regressions[0],
tf.ones_like(classifications[0, :, 0][..., tf.newaxis]),
classifications[0, :, 1][..., tf.newaxis]], 1)
priors = prior_box_tf((tf.shape(inputs)[1], tf.shape(inputs)[2]),
cfg['min_sizes'], cfg['steps'], cfg['clip'])
decode_preds = decode_tf(preds, priors, cfg['variances'])
It's so weird has anyone experienced this? And has anyone been able to export the model and make it run consistently - or am I completely missing something lol!
Thinking to rewrite the post-processing code to handle batching but before I do that just checking if anyone has been through this.
UPDATE!
Oh man I fixed the issue. It turned out to be the custom BatchNormalization
layer causing issues.
Probably something has changed since this repo was created - I'm using tf-2.8
.
Here's how to fix it. In modules/models.py
...
Update your ConvUnit
layer to look like this (just use the in-build batch-norm).
The training
argument will be automatically handled by Keras during train or inference time.
class ConvUnit(tf.keras.layers.Layer):
"""Conv + BN + Act"""
def __init__(self, f, k, s, wd, act=None, name='ConvBN', **kwargs):
super(ConvUnit, self).__init__(name=name, **kwargs)
self.conv = Conv2D(filters=f, kernel_size=k, strides=s, padding='same',
kernel_initializer=_kernel_init(),
kernel_regularizer=_regularizer(wd),
use_bias=False, name='conv')
self.bn = tf.keras.layers.BatchNormalization(
axis=-1,
momentum=0.99,
epsilon=1e-5,
center=True,
scale=True,
name="bn"
)
if act is None:
self.act_fn = tf.identity
elif act == 'relu':
self.act_fn = ReLU()
elif act == 'lrelu':
self.act_fn = LeakyReLU(0.1)
else:
raise NotImplementedError(
'Activation function type {} is not recognized.'.format(act))
def call(self, x, training=False):
return self.act_fn(self.bn(self.conv(x), training=training))
Just to be safe I also added training=False
to the call()
method on each custom layer.
And now no more issues! Can be served in tf-serving.
Soon I will finish the batched implementation with tf.image.combined_non_max_suppression
.