google-research-datasets/recognizing-multimodal-entailment

Baseline model for the dataset

sayakpaul opened this issue · 2 comments

@cesar-ilharco hi. I am Sayak, an ML Engineer from India.

Firstly, thanks to you and the entire team for putting together such a comprehensive tutorial. I had the chance to go through the deck in detail last week and I really liked the materials presented in it.

To this end, I have been working on building some baseline models that may go well with the dataset from the past week. As a result, this blog post came out. The accompanying repository is here: https://github.com/sayakpaul/Multimodal-Entailment-Baseline.

The baseline model is simple. Encode the images with a pre-trained ResNet50V2 and encode the text inputs with a pre-trained BERT (base). After extracting the encodings, project them in a unified space and finally pass those projections through a classification layer for predicting entailment/no-entailment/contradiction. In code, it looks like so (full snippet can be found inside the blog post):

def create_multimodal_model(
    num_projection_layers=1,
    projection_dims=256,
    dropout_rate=0.1,
    vision_trainable=False,
    text_trainable=False,
):
    # Receive the images as inputs.
    image_1 = keras.Input(shape=(128, 128, 3), name="image_1")
    image_2 = keras.Input(shape=(128, 128, 3), name="image_2")

    # Receive the text as inputs.
    bert_input_features = ["input_type_ids", "input_mask", "input_word_ids"]
    text_inputs = {
        feature: keras.Input(shape=(128,), dtype=tf.int32, name=feature)
        for feature in bert_input_features
    }

    # Create the encoders.
    vision_encoder = create_vision_encoder(
        num_projection_layers, projection_dims, dropout_rate, vision_trainable
    )
    text_encoder = create_text_encoder(
        num_projection_layers, projection_dims, dropout_rate, text_trainable
    )

    # Fetch the embedding projections.
    vision_projections = vision_encoder([image_1, image_2])
    text_projections = text_encoder(text_inputs)

    # Concatenate the projections and pass through the classification layer.
    concatenated = keras.layers.Concatenate()([vision_projections, text_projections])
    outputs = keras.layers.Dense(3, activation="softmax")(concatenated)
    return keras.Model([image_1, image_2, text_inputs], outputs)


multimodal_model = create_multimodal_model()
keras.utils.plot_model(multimodal_model, show_shapes=True)

Along with these I also go over the following points:

  • Modality dropout trick to make the model robust for situations where a modality might not be present.
  • Cross-attention to help the model focus on the parts of images that relate well to their textual counterparts.
  • Since the dataset suffers from class imbalance, I use two simple recipes to mitigate the problem:
    • Loss-weighted training
    • Focal loss

I hope the community would benefit from these things and it will serve as a simple baseline to foster research in the area. Do you think it makes sense to give all of these a mention from the tutorial website and this repository? I totally understand if it's not.

On a related note, I also want to use the implicit similarity signals of the examples to further regularize the training. This is doable in Neural Structured Learning and I believe @arjung has already presented a demo in the tutorial. I believe it'd be great to collaborate for that to work on a tutorial that the community could readily use. So, open to ideas here :)

Hi @sayakpaul, this is great, thank you for your contribution. I agree the community can benefit from it.
The code is clear and the Keras blog post is well documented. Happy to; I linked those on the https://multimodal-entailment.github.io/ website and on this repository's README.

Right, Arjun presented a demo in the tutorial, the colab is available here: https://colab.research.google.com/github/tensorflow/neural-structured-learning/blob/master/workshops/kdd_2020/graph_regularization_pheme_natural_graph.ipynb

Open to ideas and suggestions too

Thank you @cesar-ilharco!

That's an amazing tutorial especially because it explicitly shows how to create neighbors in a format that is compatible with NSL's graph regularization.

Open to ideas and suggestions too

I was thinking along the lines of using both the modalities as introduced in the multimodal dataset and incorporate graph regularization to build an entailment model. Maybe the NSL team has something planned. This is why I tagged Arjun in my previous comment.