/CREMA

☕️ CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Image description CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Project Website arXiv HuggingFace

University of North Carolina at Chapel Hill


teaser image

🔥 News

  • Jun 14, 2024. Check our new arXiv-version2 for exciting additions to CREMA:
    • New modality-sequential modular training & modality-adaptive early exit strategy to handle learning with many modalities.
    • More unique/rare multimodal reasoning tasks (video-touch and video-thermal QA) to further demonstrate the generalizability of CREMA

Code structure

# CREMA code
./lavis/

# running scripts for CREMA training/inference
./run_scripts

Setup

Install Dependencies

  1. (Optional) Creating conda environment
conda create -n crema python=3.8
conda activate crema
  1. build from source
pip install -e .

Download Models

Pre-trained Models

Visual Encoder: we adopt pre-trained ViT-G (1B), the codebase downloads the model automatically.

Audio Encoder: we use pre-trained BEATs (iter3+), please download the model here, and update the path in the code

3D Encoder: we conduct off-line feature extraction following 3D-LLM, please refer to this page for per-extracted features. Please change the storage in dataset config.

Multimodal Qformer: We initialize query tokens and FC layer for each MMQA in Multimodal Q-Former form pre-trained BLIP-2 model checkpoints. We hold Multimodal Q-Fromer with pre-trained MMQA-audio and MMQA-3D via HuggingFace, and Multimodal Q-Fromer initilized from BLIP-2 can be found here.

Fine-tuned Models

Dataset Modalities
SQA3D Video+3D+Depth+Norm
MUSIC-AVQA Video+Audio+Flow+Norm+Depth
NExT-QA Video+Flow+Depth+Normal

Dataset Preparation & Feature Extraction

We test our model on:

To get trimmed Touch-QA and Thermal-QA video frames, you can first download raw videos from each original data project, and preprocess with our scripts after setting the custom data path, by running.

python trim_video.py

python decode_frames.py

We extract various extra modalities from raw video with pre-train models, please refer to each model repo and paper appendix for more details.

We will share extracted features in the following table.

Dataset Multimodal Features
SQA3D Video Frames, Depth Map, Surface Normals
MUSIC-AVQA Video Frames, Optical Flow , Depth Map, Surface Normals
NExT-QA Video Frames, Depth Map, Optical Flow, Surface Normals
Touch-QA Video Frames, Surface Normals
Thermal-QA Video Frames, Depth Map

We pre-train MMQA in our CRMEA framework with public modality-specific datasets:

Training and Inference

We provide CREMA training and inference script examples as follows.

1) Training

sh run_scripts/crema/finetune/sqa3d.sh

2) Inference

sh run_scripts/crema/inference/sqa3d.sh

Acknowledgments

We thank the developers of LAVIS, BLIP-2, CLIP, X-InstructBLIP, for their public code release.

Reference

Please cite our paper if you use our models in your works:

@article{yu2024crema,
  title={CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion},
  author={Yu, Shoubin and Yoon, Jaehong and Bansal, Mohit},
  journal={arXiv preprint arXiv:2402.05889},
  year={2024}
}