/CREMA

☕️ CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Image description CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion

teaser image

teaser image

Code structure

# CREMA code
./lavis/

# running scripts for CREMA training/inference
./run_scripts

Setup

Install Dependencies

  1. (Optional) Creating conda environment
conda create -n crema python=3.8
conda activate crema
  1. build from source
pip install -e .

Download Models

Pre-trained Models

Visual Encoder: we adopt pre-trained ViT-G (1B), the codebase downloads the model automatically.

Audio Encoder: we use pre-trained BEATs (iter3+), please download the model here, and update the path in the code

3D Encoder: we conduct off-line feature extraction following 3D-LLM, please refer to this page for per-extracted features. Please change the storage in dataset config.

Multimodal Qformer: We initialize query tokens and FC layer for each MMQA in Multimodal Q-Former form pre-trained BLIP-2 model checkpoints. We hold Multimodal Q-Fromer with pre-trained MMQA-audio and MMQA-3D via HuggingFace, and Multimodal Q-Fromer initilized from BLIP-2 can be found here.

Fine-tuned Models

Dataset Modalities
SQA3D Video+3D+Depth
SQA3D Video+3D+Depth (espresso)
MUSIC-AVQA Video+Audio+Flow
MUSIC-AVQA Video+Audio+Flow (espresso)
NExT-QA Video+Flow+Depth+Normal
NExT-QA Video+Flow+Depth+Normal (espresso)

Dataset Preparation & Feature Extraction

We test our model on:

We extract various extra modalities from raw video with pre-train models, please refer to each model repo and paper appendix for more details.

We will share extracted features in the following table soon.

Dataset Multimodal Features
SQA3D Video Frames, Depth Map
MUSIC-AVQA Video Frames, Optical Flow
NExT-QA Video Frames, Depth Map, Optical Flow, Surface Normals

We pre-train MMQA in CRMEA framework with public modality-specific dataset:

Training and Inference

We provide CREMA training and inference script examples as follows.

1) Training

sh run_scripts/crema/finetune/sqa3d.sh

2) Inference

sh run_scripts/crema/inference/sqa3d.sh

Acknowledgments

We thank the developers of LAVIS, BLIP-2, CLIP, X-InstructBLIP, for their public code release.

Reference

Please cite our paper if you use our models in your works:

@article{yu2024crema,
  title={CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion},
  author={Yu, Shoubin and Yoon, Jaehong and Bansal, Mohit},
  journal={arXiv preprint arXiv:2402.05889},
  year={2024}
}