peteryuX/retinaface-tf2

Empty output on second inference call on exported saved model

louisquinn opened this issue · 1 comments

Hey everyone, and thanks @peteryuX for the great work!

I'm experiencing a weird issue where after exporting the model to the saved_model format, on the second inference call I get an empty output. The first inference call always works though - I am seeing this with both tensorflow serving and regular inference.

Here's how to reproduce...

import tensorflow as tf

import cv2
import numpy as np

from modules.models import RetinaFaceModel
from modules.utils import set_memory_growth, load_yaml, draw_bbox_landm, pad_input_image, recover_pad_output

CONFIG_PATH = "<path-to>/configs/retinaface_res50.yaml"
CHECKPOINT_PATH = "<path-to>/retinaface-tf2/checkpoints/retinaface_res50"
OUTPUT_PATH = "<path-to>/retinaface-tf2/checkpoints/retinaface_res50_export"

def main():
    image = cv2.imread("an-image-path")
    image_infer = np.expand_dims(cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32), axis=0)

    config = load_yaml(CONFIG_PATH)
    model = RetinaFaceModel(config, training=False, iou_th=0.4, score_th=0.5)

    checkpoint = tf.train.Checkpoint(model=model)
    checkpoint.restore(tf.train.latest_checkpoint(CHECKPOINT_PATH))

    # Here, inference works on every call. 
    output_ckpt = model(image_infer)   # Get a result with shape: (4, 16) which is good. 
    output_ckpt2 = model(image_infer)  # Get the same result: (4, 16)

    # Save to file. I have tried all different ways to do this.. 
    # tf.saved_model.save(model, os.path.join(OUTPUT_PATH, "saved_model"), signatures=concrete_fn)
    # tf.keras.models.save_model(model, os.path.join(OUTPUT_PATH, "saved_model"))
    # tf.saved_model.save(model, os.path.join(OUTPUT_PATH, "saved_model"))
    # All have the same issue, let's just use the simple method...
    model.save(OUTPUT_PATH)
    
    # But if we export the model, (or make a concrete function), load it in and run twice, the second call will return an empty output.
    model_loaded = tf.saved_model.load(OUTPUT_PATH)
    infer = model_loaded.signatures["serving_default"]

    output1 = infer(**{"input_image": tf.convert_to_tensor(image_infer)})  # Get a result with shape: (4, 16) which is good. 
    output2 = infer(**{"input_image": tf.convert_to_tensor(image_infer)})  # Get a result like: (0, 16)

if __name__ == "__main__":
    main()

The same behaviour happens if I make a concrete function and export like so:

    concrete_fn = tf.function(model.call).get_concrete_function(
        tf.TensorSpec(
            shape=[None, None, None, 3], dtype=tf.float32, name="image_tensor"
        ),
        training=False
    )
    tf.saved_model.save(model, OUTPUT_PATH, signatures=concrete_fn)

I have a feeling this might have something do with the code not allowing for a batch at this point in the decoding...

# only for batch size 1
preds = tf.concat(  # [bboxes, landms, landms_valid, conf]
    [bbox_regressions[0], landm_regressions[0],
     tf.ones_like(classifications[0, :, 0][..., tf.newaxis]),
     classifications[0, :, 1][..., tf.newaxis]], 1)
priors = prior_box_tf((tf.shape(inputs)[1], tf.shape(inputs)[2]),
                      cfg['min_sizes'],  cfg['steps'], cfg['clip'])
decode_preds = decode_tf(preds, priors, cfg['variances'])

It's so weird has anyone experienced this? And has anyone been able to export the model and make it run consistently - or am I completely missing something lol!

Thinking to rewrite the post-processing code to handle batching but before I do that just checking if anyone has been through this.

UPDATE!

Oh man I fixed the issue. It turned out to be the custom BatchNormalization layer causing issues.
Probably something has changed since this repo was created - I'm using tf-2.8.

Here's how to fix it. In modules/models.py...

Update your ConvUnit layer to look like this (just use the in-build batch-norm).
The training argument will be automatically handled by Keras during train or inference time.

class ConvUnit(tf.keras.layers.Layer):
    """Conv + BN + Act"""
    def __init__(self, f, k, s, wd, act=None, name='ConvBN', **kwargs):
        super(ConvUnit, self).__init__(name=name, **kwargs)
        self.conv = Conv2D(filters=f, kernel_size=k, strides=s, padding='same',
                           kernel_initializer=_kernel_init(),
                           kernel_regularizer=_regularizer(wd),
                           use_bias=False, name='conv')

        self.bn = tf.keras.layers.BatchNormalization(
            axis=-1,
            momentum=0.99,
            epsilon=1e-5,
            center=True,
            scale=True,
            name="bn"
        )

        if act is None:
            self.act_fn = tf.identity
        elif act == 'relu':
            self.act_fn = ReLU()
        elif act == 'lrelu':
            self.act_fn = LeakyReLU(0.1)
        else:
            raise NotImplementedError(
                'Activation function type {} is not recognized.'.format(act))

    def call(self, x, training=False):
        return self.act_fn(self.bn(self.conv(x), training=training))

Just to be safe I also added training=False to the call() method on each custom layer.
And now no more issues! Can be served in tf-serving.

Soon I will finish the batched implementation with tf.image.combined_non_max_suppression.