Residual Self-Attention Cross Fusion Network (RSA-CFN)

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Residual Self-Attention Cross Fusion Network (RSA-CFN)


This repository contains the code for the paper [Residual Self-Attention Cross Fusion Network] submitted to 2023 HumanUnderstanding AI Paper Competition

Autors: @hyeonho1028,@zerojin91, @aibjw


Python 3.9.X

We use conda to manage the environment. You can create the environment by running the following command:

conda env create -f environment.yaml
conda activate rsa_cfn


To run the code, you need to download the dataset from here and put it in the data folder as KEMDy20_v1_1/.

The folder structure should be like this:

├── data                    # Data folder
│   └── data.csv              # Put the datafiles here
├── img                     # Images folder
│   ├── arch.png              # Model Arch. Structure
│   └── Image.png             # Others Images
├── src                     # Source code
│   ├── data.py               # to load data
│   ├── layer.py              # DL Layer
│   ├── model.py              # Model Architecture
│   └── trainer.py            # Trainer for training and validation
│   └── utils.py              # Utility Module
├── script                  # Script folder
│   └── exp_shell.sh          # Shell script for Experiment
├── requirements.yaml       # Environment setting
├── main.py                 # Main script to run the code
└── README.md               # Description of the project


python main.py --argument1 value1 --argument2 value2 ...

To run main.py script, you need to specify the following arguments:

Base Setting

Description Default
--data_path Path to data directory 'data/KEMDy20_v1_1/'
--output_dir Path to output directory 'models/'
--ver Version name to logging 'baseline'
--audio_max_lens Number of Audio max lengths 96000
--text_max_lens Number of Text max lengths 256
--num_classes Number of classes 7
--seed Random seed 42

Device Setting

Description Default
--device Device type {cuda,cpu,mps} 'cuda'
--use_amp Use Automatic Mixed Precision for training True

Label Transform

Description Default
--label_transform Perform label transformation False
--batch_weighted_sampler Use batch weighted sampler for training False

Training Setting

Description Default
--batch_size Batch size for training 16
--val_batch_size Batch size for validation 32
--lr Learning rate 1e-5 *
--epochs Number of epochs for training 10
--weight_decay Weight decay 0.0

* In wav2vec 2.0, the learning rate is set to 1e-4.

Training mode

Description Default
--use_wav Use wav2vec feature for training False
--use_concat Concatenate modalities for training False
--multimodal_method Multimodal method for training {early_fusion,late_fusion,mlp_mixer,stack,residual,rsa,rsa_cfn,hybrid_fusion} 'early_fusion'


Description Default
--wav_model Pre-trained wav2vec model 'facebook/wav2vec2-large-960h-lv60-self'
--pooling_mode Pooling mode for feature extraction 'mean'
--rnn_model Pre-trained RNN model {xlm-roberta-base,xlm-roberta-large,klue/roberta-base,klue/roberta-large} 'klue/roberta-base'


We provide various multimodal models.

You can specify the models by using --multimodal method argument. The following table shows the available models to choose from.

Fusion Method
Early Fusion Early early-fusion
Late Fusion Late late-fusion
MLP-Mixer* Early mlp_mixer
Stack Early stack
Residual Early residual
RSA Early rsa
RSA-CFN Early rsa_cfn
Hybrid Cross-Modality hybrid_fusion

* MLP-Mixer is the first winner of the 2022 HumanUnderstanding AI Paper Competition. We use the same architecture as the reference.

Please read the description in scripts/exp_shell.sh script for more details. You can find the agument setting for each experiment in script.


The following table shows the architecture of each model.

Late Fusion Early Fusion
Stack Residual


Comparison Between PLMs

The following table shows the weighted F1-Score of each PLMs. we can see that the KLUE-RoBERTa model model is better than the XLM-RoBERTa.

Comparison Between Architecture Fusion Method

The following table shows the weighted F1-Score of each models.

A score of RSA-CFN with label transform is the best score in this experiment.

Multimodal Interaction

The image below shows the attention weight of each multimodal method.

Please read the paper for more details.


This repository is released under the Apache License 2.0. License can be found in LICENSE file.


  • Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised of speech representations." Advances in neural information processing systems 33 (2020): 12449-12460.
  • Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
  • Conneau, Alexis, et al. "Unsupervised cross-lingual representation learning at scale." arXiv preprint arXiv:1911.02116 (2019).
  • wav2vec 2.0