English Version | 中文版

😻 CAT: Convolutions, Attention & Transformers

🚀 Dive into the world of CAT! Imagine if computers could understand and combine the essence of both pictures and words, just like we humans naturally do. By marrying the strengths of Convolutions (think of it as the magic behind image filters) and Transformers (the genius tech behind language models), our CAT framework stands as a bridge, seamlessly blending visual and textual realms. So, whether you're marveling at a sunset photo or reading a poetic description, CAT seeks to decode, understand, and bring them together in harmony.

Looking for a swift kick-off? Explore our Jupyter Notebook directly in Google Colab!

📖

✨ 1. Introduction

In this experimental endeavor, an innovative model architecture is proposed that leverages the capabilities of Convolutional Neural Networks (CNNs) for the extraction of salient features from images, and Transformer-based models for gleaning intricate patterns from textual data. Termed the Convolutions, Attention & Transformers or the CAT framework, the architecture deftly integrates attention mechanisms. These mechanisms serve as an intermediate conduit, facilitating the seamless amalgamation of visual and textual modalities.

Hmmm...NOT this 'CAT'.

This is my 'CAT'!

✨ 2. Hyperparameters of the optimal model

Architecture
Extractor	Modality	Module	Number of Unfrozen Blocks
	Image	DenseNet-121	2
	Text	TinyBert
Parallelism	Property	Module	Number of Input Dimensions
	Fully-connected	Batch Normalization	896
		ReLU
		Dropout
Attention
Classifier	Property	Module	Number of Input Dimensions
	Linear		896*2

Training procedure
	Class	Details
Strategy	Batch Size	16
Strategy	Number of epochs	50
Optimization	Loss Function	Binary Cross Entropy With Logits
	Optimizer	AdamW
	Learning Rate	1e-5
	Bias Correction	False
Auxiliary	Learning Rate Scheduler	Linear
	Number of Warmup Steps	0
	Number of Training Steps	Total Number of Batches
Prediction	Output threshold	0.39

✨ 3. Data processing

💡 How to process multimodal data? That is a good point!

3.1 Image data

Image Data Augmentation

Image Data Normalization

3.2 Text data

Random swap

3.3 Multimodal data

✨ 4. Evaluation methods

F1 score
Model size

✨ 5. Experimental comparison

5.1 Image vs. Text classifiers

Image classifiers
Model	Size(MB)	Training/Validation	Threshold	F1 score	Epochs	Efficiency(second/epoch)
ResNet-18	42.74	Training	0.5	0.6712	20	199.75
		Validation		0.6612
ResNet-34	81.36	Training		0.6406		157.99
		Validation		0.6304
DenseNet-201	70.45	Training		0.6728		179.58
		Validation		0.6594
ResNet-50	90.12	Training		0.7090		175.89
		Validation		0.7063
		Training		0.7283	50	163.43
		Validation		0.7260
Text classifiers
Model	Size(MB)	Training/Validation	Threshold	F1 score	Epochs	Efficiency(second/epoch)
Tiny Bert	54.79	Training	0.5	0.5955	50	53.76
		Validation		0.5975
Bert Tiny	16.76	Training	0.635	0.5960	50	17.30
		Validation		0.5989

5.2 Unimodal vs. Multimodal models

Unimodal models(Optimal)
Model	Size(MB)	Training/Validation	Threshold	F1 score	Epochs	Efficiency(second/epoch)
ResNet-50	90.12	Training	0.5	0.7283	50	163.43
		Validation		0.7260
Bert Tiny	16.76	Training	0.635	0.5960		17.30
		Validation		0.5989
Multimodal models(Baseline)
Model	Size(MB)	Training/Validation	Threshold	F1 score	Epochs	Efficiency(second/epoch)
DensityBert(DenseNet-121 + TinyBert)	97.71	Training	0.35	0.8173	50	191.48
		Validation		0.8173
Bensity(DenseNet-201 + BertTiny)	100.83	Training	0.33	0.7980		190.01
		Validation		0.7980
ResT(ResNet-34 + BertTiny)	100.92	Training	0.38	0.7836		170.06
		Validation		0.7766

5.3 Self-attention vs. Cross-attention mechanisms

Self-attention mechanism
Model	Query	Size(MB)	Training/Validation	Threshold	F1 score	Epochs	Efficiency(second/epoch)
DensityBert	/	97.71	Training	0.35	0.8173	50	191.48
			Validation		0.8173
Bensity		100.83	Training	0.33	0.7980		190.01
			Validation		0.7980
Cross-attention mechanism
Model	Query	Size(MB)	Training/Validation	Threshold	F1 score	Epochs	Efficiency(second/epoch)
CDBert	Text	93.81	Training	0.29	0.7599	50	181.51
			Validation		0.7564
IMCDBert	Image	91.02	Training	0.46	0.8026		147.75
			Validation		0.7985
Censity	Text	90.89	Training	0.33	0.7905		183.50
			Validation		0.7901
IMCensity	Image	81.14	Training	0.38	0.7869		174.49
			Validation		0.7801

5.4 Variations of DensityBert

Variations
Model	Size(MB)	Training/Validation	Threshold	F1 score	Epochs	Efficiency(second/epoch)
DensityBert	97.71	Training	0.35	0.8173	50	191.48
		Validation		0.8173
MoDensityBert	97.72	Training	0.38	0.8622		178.95
		Validation		0.8179
WarmDBert	97.72	Training	0.38	0.8505		204.09
		Validation		0.8310
WarmerDBert	97.72	Training	0.39	0.8567		258.34
		Validation		0.8345
WWDBert	99.77	Training	0.40	0.8700	100	269.93
		Validation		0.8464

✨ 6. Competition information

Competition link: Here
My ranking

✨ 7. Poster

✨ 8. Project structure

├── cat          
|   ├── attentions.py     
|   ├── datasets.py
|   ├── evaluator.py
|   ├── multimodal.py
|   ├── predict.py
|   ├── trainer.py
|   └── __init__.py
├── data
|   ├── data/
|   │    └── *.jpg
|   ├── train.csv
|   └── test.csv
├── model_hub      
|   └── *.pth
├── outcomes/      
│   └── *.jpg/*.png/*.jpeg
├── LICENSE
├── notebook.ipynb
├── README.md
├── README.zh-CN.md
└── requirements.txt

XavierSpycy/CAT-ImageTextIntegrator