An innovative deep learning framework leveraging the CAT (Convolutions, Attention & Transformers) architecture to seamlessly integrate visual and textual modalities. This model exploits the prowess of CNNs for image feature extraction and Transformers for intricate textual pattern recognition, setting a new paradigm in multimodal learning.
π Dive into the world of CAT! Imagine if computers could understand and combine the essence of both pictures and words, just like we humans naturally do. By marrying the strengths of Convolutions (think of it as the magic behind image filters) and Transformers (the genius tech behind language models), our CAT framework stands as a bridge, seamlessly blending visual and textual realms. So, whether you're marveling at a sunset photo or reading a poetic description, CAT seeks to decode, understand, and bring them together in harmony.
Looking for a swift kick-off? Explore our Jupyter Notebook directly in Google Colab!
π
β¨ 1. Introduction
In this experimental endeavor, an innovative model architecture is proposed that leverages the capabilities of Convolutional Neural Networks (CNNs) for the extraction of salient features from images, and Transformer-based models for gleaning intricate patterns from textual data. Termed the Convolutions, Attention & Transformers or the CAT framework, the architecture deftly integrates attention mechanisms. These mechanisms serve as an intermediate conduit, facilitating the seamless amalgamation of visual and textual modalities.
Hmmm...NOT this 'CAT'.
This is my 'CAT'!
β¨ 2. Hyperparameters of the optimal model
Architecture
Extractor
Modality
Module
Number of Unfrozen Blocks
Image
DenseNet-121
2
Text
TinyBert
Parallelism
Property
Module
Number of Input Dimensions
Fully-connected
Batch Normalization
896
ReLU
Dropout
Attention
Classifier
Property
Module
Number of Input Dimensions
Linear
896*2
Training procedure
Class
Details
Strategy
Batch Size
16
Number of epochs
50
Optimization
Loss Function
Binary Cross Entropy With Logits
Optimizer
AdamW
Learning Rate
1e-5
Bias Correction
False
Auxiliary
Learning Rate Scheduler
Linear
Number of Warmup Steps
0
Number of Training Steps
Total Number of Batches
Prediction
Output threshold
0.39
β¨ 3. Data processing
π‘ How to process multimodal data? That is a good point!