Yufeng Yin*, Jiashu Xu*, Tianxin Zu, and Mohammad Soleymani
Correspondence to:
- Yufeng Yin (yin@ict.usc.edu)
This is the official Pytorch implementation for X-Norm: Exchanging Normalization Parameters for Bimodal Fusion.
This repo contains the following methods for multimodal fusion:
We present X-Norm, a novel, simple and efficient method for bimodal fusion that generates and exchanges limited but meaningful normalization parameters between the modalities implicitly aligning the feature spaces.
- Python 3.9
- PyTorch 1.11
- CUDA 10.1
Step 1: Download the RGB and Optical flow frames of kitchens P01, P08, and P22 from EPIC_KITCHENS-100 and put them into the data/epic_kitchens
fold.
Step 2: Download the pretrained weights rgb_imagenet.pt and flow_imagenet.pt and put them into the checkpoints
fold.
Unimodal methods (RGB or Optical flow)
python main.py --fusion rgb/flow
Multimodal methods
python main.py --fusion early/late/misa/mult/gb/xnorm
If you find this work or code is helpful in your research, please cite:
@inproceedings{yin2022x,
title={X-Norm: Exchanging Normalization Parameters for Bimodal Fusion},
author={Yin, Yufeng and Xu, Jiashu and Zu, Tianxin and Soleymani, Mohammad},
booktitle={INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION},
pages={605--614},
year={2022}
}
[1] Hazarika, Devamanyu, Roger Zimmermann, and Soujanya Poria. "Misa: Modality-invariant and-specific representations for multimodal sentiment analysis." Proceedings of the 28th ACM international conference on multimedia. 2020.
[2] Tsai, Yao-Hung Hubert, et al. "Multimodal transformer for unaligned multimodal language sequences." Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 2019. NIH Public Access, 2019.
[3] Wang, Weiyao, Du Tran, and Matt Feiszli. "What makes training multi-modal classification networks hard?." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.