This repository is about next speaker prediction of EgoCom Dataset by models including transformers.
You can visit the dataset website for more details: [HERE]
This dataset is moluti-modal which consists of features extracted from three modalities of Text, Video and Audio.
The aim of this project is utilizing three modalities by transformer models to recognize next speaker with higer accuracy.
There are 4 mothods to investigate efficiency of different combinations of modalities and transformers and could be run with different settings.
The pdovided methods are:
-
Early Fusion Transformer(EFT)
-
Late fusion transformer with soft-ranking(LFT+SR)
-
Late-fusion transformer-based model with softmax and attention (LFT+SFA)
-
Late-fusion transformer-based model with multi-head attention output and attention layer (LFT+AA)
-
Multi-stream multi-modal hybrid transformer (MSMMHT)
-
Multi-stream multi-modal vanila transformer (MSMMVT)
Before using the codes dataset address should be adjusted inside the Prepare_data.py script.
Follow these commands to install Egocom-IRI-UPC in your environment. The required libraries are standard.
git clone https://github.com/mehdifatan/Egocom-IRI-UPC
cd Egocom-IRI-UPC
Install the dependencies listed in requirements.txt:
pip install -r requirements.txt
Here is a sample usage of the scripts:
python LFT+AA.py --include-prior true --future-pred 5 --history-sec 4