This repository contains the official implementation of our CVPR 2024 Highlight paper Data-Efficient Multimodal Fusion on a Single GPU. We release code for the image-text setting, including code for dataset downloading, feature extraction, fusion training and evaluation. We note that our code is based on the LAVIS library.
- (Optional) Creating conda environment
conda create -n fusemix python=3.8
conda activate fusemix
- Build from source
git clone https://github.com/layer6ai-labs/fusemix
cd fusemix
pip install -e .
Model zoo summarizes supported models, to view:
from lavis.models import model_zoo
print(model_zoo)
# ======================================================================
# Architectures Types
# ======================================================================
# dinov2_feature_extractor vits14, vitb14, vitl14, vitg14
# bge_feature_extractor large
# cohere_feature_extractor v3
# mlp_contrastive_fusion base
Dataset zoo summarizes supported datasets, to view:
from lavis.datasets.builders import dataset_zoo
dataset_names = dataset_zoo.get_names()
print(dataset_names)
Please refer to lavis/datasets/download_scripts
for scripts to download the required datasets.
bash run_scripts/feature_extract/feat_extract_bge_large_coco_cap.sh
bash run_scripts/fusion/mlp_contrastive_fusion_pretrain_dinov2_vitg14_bge_large_coco_vg_sbu_cap_cc3m.sh
bash run_scripts/fusion/mlp_contrastive_fusion_retrieval_dinov2_vitg14_bge_large_coco.sh
If you find this work useful in your research, please cite the following paper:
@inproceedings{vouitsis2024dataefficient,
title={Data-Efficient Multimodal Fusion on a Single GPU},
author={No{\"e}l Vouitsis and Zhaoyan Liu and Satya Krishna Gorti and Valentin Villecroze and Jesse C. Cresswell and Guangwei Yu and Gabriel Loaiza-Ganem and Maksims Volkovs},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024},
}