Provable Dynamic Fusion for Low-Quality Multimodal Data

This is the official implementation for Provable Dynamic Fusion for Low-Quality Multimodal Data (ICML 2023) by Qingyang Zhang, Haitao Wu, Changqing Zhang , Qinghua Hu, Huazhu Fu, Joey Tianyi Zhou and Xi Peng

This paper provides a theoretical framework to understand the criterion of robust dynamic multimodal fusion.
A novel dynamic multimodal fusion method termed Quality-aware Multimodal Fusion (QMF) is proposed for provably better generalization ability.

Enviroment setup

pip install -r requirements.txt

Dataset preparation

Text-Image Classification:

Step 1: Download food101 and MVSA_Single and put them in the folder datasets.

Step 2: Prepare the train/dev/test splits jsonl files. We follow the MMBT settings and provide them in corresponding folders.

Step 3 (optional): If you want use Glove model for Bow model, you can download glove.840B.300d.txt and put it in the folder datasets/glove_embeds. For bert model, you can download bert-base-uncased (Google Drive Link ) and put in the root folder bert-base-uncased/.
RGBD Scene Recognition:

Step 1: Download NYUD2 and SUNRGBD and put them in the folder datasets.

Feel free to use Baidu Netdisk for food101 MVSA_Single NYUD2 SUNRGBD.

Trained Model

We provide the trained models at Baidu Netdisk.

Pretrained bert model at Baidu Netdisk.

Usage Example: Text-Image Classification

Note: Sheels for reference are provided in the folder shells

To run our method on benchmark datasets:

task="MVSA_Single" or "food101"
task_type="classification"
model="latefusion"
name=$task"_"$model"model_run_df$i"

python train_qmf.py --batch_sz 16 --gradient_accumulation_steps 40  \
    --savedir ./saved/$task --name $name  --data_path ./datasets/ \
    --task $task --task_type $task_type  --model $model --num_image_embeds 3 \
    --freeze_txt 5 --freeze_img 3   --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0

To run tmc:

python train_tmc.py --batch_sz 16 --gradient_accumulation_steps 40  \
    --savedir ./saved/$task --name $name  --data_path ./datasets/ \
    --task $task --task_type $task_type  --model $model --num_image_embeds 3 \
    --freeze_txt 5 --freeze_img 3   --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0

To run Others:

task="MVSA_Single" or "food101"
task_type="classification"
model="bow" "bert" "img" "concatbert" "concatbow" "mmbt"
name=$task"_"$model"model_run$i"

python train.py --batch_sz 16 --gradient_accumulation_steps 40  \
    --savedir ./saved/$task --name $name  --data_path ./datasets/ \
    --task $task --task_type $task_type  --model $model --num_image_embeds 3 \
    --freeze_txt 5 --freeze_img 3   --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0

Citation

If our QMF or the idea of dynamic multimodal fusion methods are helpful in your research, please consider citing our paper:

@inproceedings{zhang2023provable,
  title={Provable Dynamic Fusion for Low-Quality Multimodal Data},
  author={Zhang, Qingyang and Wu, Haitao and Zhang, Changqing and Hu, Qinghua and Fu, Huazhu and Zhou, Joey Tianyi and Peng, Xi},
  booktitle={International Conference on Machine Learning},
  year={2023}
}

Acknowledgement

The code is inspired by TMC: Trusted Multi-View Classification and Confidence-Aware Learning for Deep Neural Networks.

Related works

There are many interesting works related to this paper:

For any additional questions, feel free to email qingyangzhang@tju.edu.cn.

deepffff/QMF