This is the official implementation for Provable Dynamic Fusion for Low-Quality Multimodal Data (ICML 2023) by Qingyang Zhang, Haitao Wu, Changqing Zhang , Qinghua Hu, Huazhu Fu, Joey Tianyi Zhou and Xi Peng
- This paper provides a theoretical framework to understand the criterion of robust dynamic multimodal fusion.
- A novel dynamic multimodal fusion method termed Quality-aware Multimodal Fusion (QMF) is proposed for provably better generalization ability.
pip install -r requirements.txt
-
Text-Image Classification:
Step 1: Download food101 and MVSA_Single and put them in the folder datasets.
Step 2: Prepare the train/dev/test splits jsonl files. We follow the MMBT settings and provide them in corresponding folders.
Step 3 (optional): If you want use Glove model for Bow model, you can download glove.840B.300d.txt and put it in the folder datasets/glove_embeds. For bert model, you can download bert-base-uncased (Google Drive Link ) and put in the root folder bert-base-uncased/.
-
RGBD Scene Recognition:
Step 1: Download NYUD2 and SUNRGBD and put them in the folder datasets.
Feel free to use Baidu Netdisk for food101 MVSA_Single NYUD2 SUNRGBD.
We provide the trained models at Baidu Netdisk.
Pretrained bert model at Baidu Netdisk.
Note: Sheels for reference are provided in the folder shells
To run our method on benchmark datasets:
- task="MVSA_Single" or "food101"
- task_type="classification"
- model="latefusion"
- name=$task"_"$model"model_run_df$i"
python train_qmf.py --batch_sz 16 --gradient_accumulation_steps 40 \
--savedir ./saved/$task --name $name --data_path ./datasets/ \
--task $task --task_type $task_type --model $model --num_image_embeds 3 \
--freeze_txt 5 --freeze_img 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0
To run tmc:
python train_tmc.py --batch_sz 16 --gradient_accumulation_steps 40 \
--savedir ./saved/$task --name $name --data_path ./datasets/ \
--task $task --task_type $task_type --model $model --num_image_embeds 3 \
--freeze_txt 5 --freeze_img 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0
To run Others:
- task="MVSA_Single" or "food101"
- task_type="classification"
- model="bow" "bert" "img" "concatbert" "concatbow" "mmbt"
- name=$task"_"$model"model_run$i"
python train.py --batch_sz 16 --gradient_accumulation_steps 40 \
--savedir ./saved/$task --name $name --data_path ./datasets/ \
--task $task --task_type $task_type --model $model --num_image_embeds 3 \
--freeze_txt 5 --freeze_img 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0
If our QMF or the idea of dynamic multimodal fusion methods are helpful in your research, please consider citing our paper:
@inproceedings{zhang2023provable,
title={Provable Dynamic Fusion for Low-Quality Multimodal Data},
author={Zhang, Qingyang and Wu, Haitao and Zhang, Changqing and Hu, Qinghua and Fu, Huazhu and Zhou, Joey Tianyi and Peng, Xi},
booktitle={International Conference on Machine Learning},
year={2023}
}
The code is inspired by TMC: Trusted Multi-View Classification and Confidence-Aware Learning for Deep Neural Networks.
There are many interesting works related to this paper:
- Uncertainty-based Fusion Netwok for Automatic Skin Lesion Diagnosis
- Uncertainty Estimation for Multi-view Data: The Power of Seeing the Whole Picture
- Reliable Multimodality Eye Disease Screening via Mixture of Student's t Distributions
- Trusted Multi-Scale Classification Framework for Whole Slide Image
- Fast Road Segmentation via Uncertainty-aware Symmetric Network
- Trustworthy multimodal regression with mixture of normal-inverse gamma distributions
- Uncertainty-Aware Multiview Deep Learning for Internet of Things Applications
- Automated crystal system identification from electron diffraction patterns using multiview opinion fusion machine learning
- Trustworthy Long-Tailed Classification
- Trusted multi-view deep learning with opinion aggregation
- EvidenceCap: Towards trustworthy medical image segmentation via evidential identity cap
- Federated Uncertainty-Aware Aggregation for Fundus Diabetic Retinopathy Staging
- Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification
For any additional questions, feel free to email qingyangzhang@tju.edu.cn.