The Emotions of the Crowd: Learning Image Sentiment from Tweets via Cross-modal Distillation

Introduction

This work introduced a cross-modal learning method to train visual models for Sentiment Analysis in the Twitter domain.

It was used to fine-tune the Vision-Transformer (ViT) model pre-trained on Imagenet-21k, which was able to achieve incredible results on external benchmarks which were manually annotated, even beating the current State Of The Art!

We crawled ∼3.7 pictures from social media from 1 April to 30 June to use them in our cross-modal approach. In particular, a cross-modal teacher-student learning technique was used to avoid human annotators, thus minimizing the efforts required and allowing for the creation of vast training sets.

The latter can help future research to train robust visual models, as the number of parameters of current SOTA is exponentially growing along with their need of data to avoid overfitting problems.

Build Instructions

Clone codebase

$ git config --global http.postBuffer 1048576000
$ git clone --recursive https://github.com/fabiocarrara/cross-modal-visual-sentiment-analysis/tree/master

Install dependencies

$ chmod +x install_dependencies.sh
$ ./install_dependencies.sh

How to use the scripts for benchmark evaluation

Test a model with a benchmark, get the accuracy and save the prediction

$ python3 scripts/test_benchmark.py -m <model_name> -b <benchmark_name>

$ options for <model_name>: [boosted_model, ViT_L16, ViT_L32, ViT_B16, ViT_B32, merged_T4SA, bal_flat_T4SA2.0, bal_T4SA2.0, unb_T4SA2.0, B-T4SA_1.0_upd_filt, B-T4SA_1.0_upd, B-T4SA_1.0]
$ options for <benchmark_name>: [5agree, 4agree, 3agree, FI_complete, emotion_ROI_test, twitter_testing_2]

Execute a five fold cross validation on a benchmark, get the mean accuracy, the standard deviation and save the predictions (by default use the boosted_model)

$ python3 scripts/5_fold_cross.py -b <benchmark_name>

Fine tune FI on the five split, get the mean accuracy, the standard deviation and save the predictions (by default use the boosted_model)

$ python3 scripts/fine_tune_FI.py

Trained Models

		Confidence Filter Threshold				Accuracy on Twitter Dataset (TD)
Label	Dataset	Pos	Neu	Neg	Student Arch	5 agree	$\ge$ 4 agree	$\ge$ 3 agree
Model 3.1	A	-	-	-	B/32	82.2	78.0	75.5
Model 3.2	A	0.70	0.70	0.70	B/32	84.7	79.7	76.6
Model 3.3	B	0.70	0.70	0.70	B/32	82.3	78.7	75.3
Model 3.4	B	0.90	0.90	0.70	B/32	84.4	80.3	77.1
Model 3.5	A+B	0.90	0.90	0.70	B/32	86.5	82.6	78.9
Model 3.6	A+B	0.90	0.90	0.70	L/32	85.0	82.4	79.4
Model 3.7	A+B	0.90	0.90	0.70	B/16	87.0	83.1	79.4
Model 3.8	A+B	0.90	0.90	0.70	L/16	87.8	84.8	81.9

Data

COMING SOON

BibTeX

@article{serra2023emotions,
    author    = {Serra, Alessio and Carrara, Fabio and Tesconi, Maurizio and Falchi, Fabrizio},
    title     = {The Emotions of the Crowd: Learning Image Sentiment from Tweets via Cross-modal Distillation},
    journal   = {ECAI},
    year      = {2023},
}